Xin Zhao
Institute of Information Engineering
Chinese Academy of Sciences, China
zhaoxin@iie.ac.cn Xiaojun Chen
Institute of Information EngineeringChina
Chinese Academy of Sciences, China
chenxiaojun@iie.ac.cn Haoyu Gao
School of Computer Science
Georgia Institute of Technology, USA
gao.howard517@gmail.com
Abstract
Due to the remarkable generative potential of diffusion-based models, numerous researches have investigated jailbreak attacks targeting these frameworks. A particularly concerning threat within image models is the generation of Not-Safe-for-Work (NSFW) content. Despite the implementation of security filters, numerous efforts continue to explore ways to circumvent these safeguards. Current attack methodologies primarily encompass adversarial prompt engineering or concept obfuscation, yet they frequently suffer from slow search efficiency, conspicuous attack characteristics and poor alignment with targets. To overcome these challenges, we propose Antelope, a more robust and covert jailbreak attack strategy designed to expose security vulnerabilities inherent in generative models. Specifically, Antelope leverages the confusion of sensitive concepts with similar ones, facilitates searches in the semantically adjacent space of these related concepts and aligns them with the target imagery, thereby generating sensitive images that are consistent with the target and capable of evading detection. Besides, we successfully exploit the transferability of model-based attacks to penetrate online black-box services. Experimental evaluations demonstrate that Antelope outperforms existing baselines across multiple defensive mechanisms, underscoring its efficacy and versatility.
Disclaimer: This paper contains unsafe imagery that might be offensive to some readers.
1 Introduction
Recent advancements have highlighted the revolutionary capabilities of generative models, particularly those utilizing transformer [37, 2, 6] and diffusion [11, 35] architectures. The convergence of these technologies has produced increasingly powerful models for image [31, 28, 29, 21] and video [7, 24] generation. However, the vulnerabilities inherent in these models give rise to emerging safety concerns [44, 16, 4, 26, 30, 41]. Chief among these is the issue of misalignment, which facilitates the generation of harmful or inappropriate content, such as Not-Safe-for-Work (NSFW) imagery that includes nudity, violence, gore, and other potential sensitive materials [26, 40].
To mitigate the issue of inappropriate generation, developers of Text-to-Image (T2I) models have implemented external defense measures like text filters [28, 29, 20] and image filters [31], as shown in Fig.1. Additionally, significant efforts [15, 34, 8] have been made to enhance the internal safety and robustness of T2I models through retraining or fine-tuning. Under these defense mechanisms, the generation of explicit NSFW content from inappropriate prompts is effectively blocked, while normal prompts continue to produce appropriate and non-sensitive imagery.
Despite these safeguards, the inherent ambiguity in the text space and the misalignment between text and image space sustainably create fertile ground for jailbreak attacks, which seek to circumvent the system’s safety and ethical guardrails. For instance, SneakyPrompt [41] demonstrates that perturbing similar words (e.g., “nice” vs. “n1ce”) or using synonyms to paraphrase input while preserving the original semantics can alter the prediction results of T2I models. As illustrated in Fig.1, the overarching goal of jailbreaking T2I models is to craft adversarial prompts that, while classified as benign, generate harmful imagery capable of evading multiple defense mechanisms. To achieve this, JPA [17] identifies adversarial prompts within the sensitive regions of text space by appending learnable tokens to each input, while SneakyPrompt [41] employs reinforcement learning to uncover such adversarial prompts. MMA [39] introduces a greedy search method based on gradient optimization, which requires perturbations in both the text and image modalities to bypass post-synthesis image checkers. Although these methods demonstrate successful jailbreaks, they are computationally expensive due to the extensive search process. In contrast, QF-Attack [45] employs three strategies (greedy, genetic, and PGD [18]) and achieves greater efficiency by restricting the search space to a character table. However, the adversarial prompts generated by QF-Attack [45] suffer from poor alignment with the original semantic intent of the target image.
Additionally, our analysis of existing jailbreak attack methods reveals that their adversarial prompts frequently contain superfluous or nonsensical words and symbols, making anomalies easy to observe and detect. Such prompts may produce sensitive images when tested on offline models such as Stable Diffusion [31]. However, on more advanced models like GPT-4o [23] and Midjourney [20], these prompts are flagged and require further clarification, which can be seen in Fig.2. This finding underscores the need for developing more subtle adversarial prompts capable of bypassing various safety mechanisms.
In light of these challenges, our primary task is to develop an efficient method for searching adversarial prompts that can bypass safety content moderation systems. Our approach is guided by three key objectives:
Objective I: Identifying adversarial prompts that can effectively bypass safety filters.
Objective II: Improving the alignment and concealment of adversarial prompts.
Objective III: Minimizing the total searching time.
To achieve Objective I, we replace conspicuous adversarial terms in the original prompts and append specific suffix words to compose adversarial prompts that can bypass safety filters. For Objective II, we ensure that these suffix words are inconspicuous and maintain high cosine similarity between the adversarial text embedding and both the reference image embedding and the original text embedding, offering strong alignment and concealment. To meet Objective III, we optimize the search process by filtering candidate vocabulary list, setting optimal threshold, and implementing early stopping upon identifying a suitable prompt.
The main contributions are summarized as follows:
We design and implement a highly effective jailbreak attack strategy, Antelope, to explore adversarial prompts that can bypass the safety mechanisms of T2I models.
Antelope is compared with multiple attack methods across various defense baselines, demonstrating outstanding superiority and exceptional robustness.
Extensive evaluation and analysis highlight Antelope’s efficiency in generating adversarial prompts with minimal detection risk and high semantic alignment.
2 Related Work
Defensive methods against NSFW generation.Current defense strategies for Text-to-Image (T2I) models can be broadly divided into external and internal defenses. External defenses typically involve post-hoc content moderation, employing prompt checkers to identify and filter malicious prompts or image checkers to censor NSFW elements in synthesized images. For instance, Rando etal. describe how the Stable Diffusion safety filter blocks images that closely resemble any of 17 pre-defined “sensitive concepts” within CLIP model’s embedding space. Similarly, platforms such as Dalle 3 [1], Leonardo.Ai [14], and Midjourney [20] implement prompt checkers that detect and reject malicious prompts upon submission. Internal defenses, on the other hand, focus on model-level modifications to eliminate unsafe content. ConceptPrune [3] demonstrates that neurons in latent diffusion models (LDMs) [31] often specialize in specific concepts like nudity and that pruning these neurons can permanently eliminate undesired concepts from image generation. Approaches like ESD [8] and SLD [34] employ model fine-tuning to directly reduce NSFW outputs, enhancing the intrinsic safety of T2I models. To counter jailbreak attempts via text prompts, SafeGen [15] modifies self-attention layers within the model, effectively filtering out unsafe visual representations regardless of the textual input. In this paper, we intend to explore potential strategies that can effectively bypass these defense mechanisms.
Adversarial attacks on T2I models.SurrogatePrompt [41] and DACA [5] harness the power of large language models (LLMs) [2, 23] to substitute explicit words or disassemble unethical prompts into benign descriptions of individual elements, successfully bypassing safety filters of T2I models like Midjourney [20] and Dalle 2 [29]. Rather than relying on auxiliary models or tools, other works [36, 43, 45] focus on internal mechanisms such as concept retrieval [36] or concept removal [43, 45] to achieve attacks. However, Ring-A-Bell [36] lacks precise control over synthesis specifics, and UnlearnDiff [43] offers limited effectiveness against more comprehensive defense strategies. Notably, QF-Attack [45] empirically shows that a subtle five-character perturbation can induce significant content shifts in images synthesized by Stable Diffusion [31], though it risks misalignment due to simple character substitution. Furthermore, SneakyPrompt [41] leverages reinforcement learning to substitute explicit target words in the original prompts, while MMP-Attack [38] effectively replaces primary objects in images by appending optimized suffixes. Additionally, both MMP-Attack [38] and RT-attack [9] specifically align adversarial prompts with reference images, which effectively increase similarity scores and enhance alignment with target images. The primary distinction of PRISM [10] and MMA-Diffusion [39] from previous methods lies in their approach of updating the entire sampling distribution of prompts, rather than directly modifying individual prompt tokens or embeddings. Inspired by gradient-based optimization in NLP (Natural Language Processing), MMA-Diffusion [39] and UPAM [25] apply token-level gradients for refined optimization, yet this method often suffers from inefficiencies inherent to gradient-driven approaches. In this work, we aim to develop a more efficient method for identifying adversarial prompts that not only evade content moderation systems but also maintain strong alignment and concealment.
3 Methodology
3.1 Preliminary
Text-to-Image (T2I) models, initially demonstrated by Mansimov etal. [19], generate synthetic images from natural language descriptions known as prompts. These models typically consist of a language model to process the input prompt, such as BERT [6] or CLIP’s text encoder [27], and an image generation module like VQGAN [42] or diffusion model [11] for synthesizing images. In case of Stable Diffusion [31], a pre-trained CLIP encoder is utilized to tokenize and project a text into its corresponding embedding representation . The text embedding guides the image generation process which is facilitated by a latent diffusion model. This model compresses the image space into a lower-dimensionallatent space, and utilize a U-Net [32] architecture to sample images. The architecture serve as a Markovian hierarchical denoising autoencoder to generate images by sampling from random latent Gaussian noise and iteratively denoising the sample. Once the denoising process is complete, the latent representation is decoded back into image space by an image decoder .
3.2 Threat Model
In this study, we conduct a comprehensive evaluation of the impact of Antelope on robust T2I models across two practical attack scenarios.
White-Box Setting: Adversaries exploit open-source T2I models like SDv14 [31] for image generation, with full access to the model’s architecture, checkpoints, and integrated safety mechanisms. However, attackers do not alter the model’s architecture or parameters; rather, they focus on utilizing the outputs produced by the model’s components (i.e., text encoder and image encoder) to perform in-depth exploration and analysis that inform their attack strategies.
Black-Box Setting: Attackers generate images using online T2I services like Midjourney [20] and Leoanrdo.AI [14]. Without direct access to proprietary model parameters or visibility into the integrated safety mechanisms, they merely rely on transfer attacks. By interacting with these services, adversaries adapt their jailbreaking methods to effectively bypass internal safety measures.
3.3 System Design
Given a T2I model , we define the following functions: the text encoder , which tokenizes and projects text inputs into text embeddings ; the image encoder , which projects images into image embeddings ; and the image decoder , which decodes image embeddings back into images . Let represent an original prompt and denote a target attribute type (i.e. “nudity” or “violence”). Our objective is to explore an adversarial prompt which can generate sensitive images that not only reflect the specified attribute type and original prompt but also successfully bypass all safety check mechanisms.
The overall pipeline of Antelope is illustrated in Fig.3. The process begins by identifying and replacing adversarial terms in the original prompt to create a clean prompt that can effectively pass the text checker.Next, we select several candidate token pairs, “negative” and “positive” indicating with and without sensitive semantics respectively, according to the target attribute type. The text embedding can be obtained by adding positive embedding and subtracting negative embedding from the clean prompt embedding , allowing us to align with the adversarial prompt embedding . Here ‘’ means concatenation and ‘s’ means suffix tokens. For image alignment, we calculate the similarity between and the reference image embedding . A threshold is set for the combined text and image similarity score. If the similarity loss surpasses this threshold, we generate images from the adversarial prompt and verify if they pass the NSFW filter. Once the generated images bypass the filter, we output the adversarial prompt; otherwise, we continue the search process.
Similar Token Selection.We simulate the distribution of both negative and positive prompts from machine view and human view, as illustrated in Fig.4. Intuitively, for prompts with similar sentiment, the distribution of consistent judgments between human and machine should be dense, whereas opposing interpretations should exhibit a sparser distribution. Inspired by PGJ [12] and its PSTSI principle (i.e., identifying a safe substitution phrase that is perceptually similar to the target unsafe words but semantically divergent), we propose the hypothesis: prompts that are distant in semantic space may still generate visually similar outputs in image space, and conduct a preliminary experiment to validate this. As shown in Fig.5, we generate 50 prompts for each concept (red blood, red liquid, red pigment and watermelon juice) and visualize the embeddings of both text prompts and corresponding images by TSNE. These visualizations clearly indicate that, although these concepts differ significantly in semantic space, they show no obvious diversity in image space. Besides, JPA [17] claims that the semantic attributes embedded in these soft embeddings, which can be added or subtracted, stem from the initial semantic alignment capability in the pre-trained text space. Building on these insights, we use ChatGPT [23] to select positive and negative token pairs based on the target attribute type . These token pairs are then sent into text encoders to obtain their respective embeddings and .
Adversarial Text Search.To clarify, the adversarial text prompts we search for should meet both alignment and concealment, which is a difficult task as these two objectives usually conflict each other. To address this, we break down the process into several steps. For concealment, we first preprocess original prompts by replacing explicit adversarial terms, as text filters may reject prompts with direct NSFW indicators. To preserve semantic integrity, we selectively replace words that convey strong harmfulness. For example, in the original prompt “two violent persons”, we substitute “violent” with “crying”, creating a new prompt “two crying persons” which is harmless. However, this substitution significantly alters the prompt’s original meaning. To achieve text alignment, we reintroduce the concept by adding the “violent” embedding. Rather than directly adding a sensitive embedding, we leverage our selected token pairs. Positive and negative tokens may differ in textual semantics yet yield similar imagery. By subtracting the negative embedding and adding the positive embedding , we obtain the adjusted text embedding for text alignment, which can be formulated as:
(1) |
For image alignment, we first generate multiple images from the original prompt using offline models without safety checkers and select one suitable image as the reference. This reference image is then processed by the image decoder to obtain the image embedding . For the adversarial prompt, we define it as the clean prompt with an appended suffix of tokens, setting or in this case. To search for such a prompt, we begin by removing NSFW-related tokens associated with the target attribute, and then search through the remaining vocabulary list. We define the text loss function as the cosine similarity between and , and the image loss function as the cosine similarity between and , which can be clarified as Eq.2 and Eq.3:
(2) |
(3) |
Then our learning objective is to optimize Eq.4 where is a weighting factor to balance the loss terms between the image and text modalities.
(4) |
Jailbreak Safety Checker.To efficiently bypass safety checkers, we implement a two-fold judgment strategy. Firstly, we set a threshold for the loss function. If , the algorithm continues searching. Once , the selected adversarial prompt is fed into the T2I model to generate images, which are subsequently evaluated by the NSFW filter. If the generated images pass the filter, the adversarial prompt is returned and the process is terminated; otherwise, the search process continues iteratively.
4 Experiment
Nudity SDv14 [31] SDv21 [31] ESD [8] SafeGen [15] SLD-max [34] SLD-strong [34] SLD-medium [34] SLD-weak [34] ASR (%) SneakyPrompt [41] 66.36 43.93 11.53 60.44 18.69 24.30 46.11 57.01 QF-Attack [45] 70.27 49.55 11.11 58.86 18.32 28.83 45.05 56.76 MMP-Attack [38] 74.29 51.47 11.02 71.17 21.26 29.58 69.49 73.09 MMA-Diffusion [39] 70.87 48.65 15.02 69.07 27.33 35.44 58.86 65.17 Antelope (Ours) 81.98 57.96 12.91 68.47 34.53 50.75 74.47 81.08 FID SneakyPrompt [41] 34.20 40.03 54.99 62.41 62.15 51.12 39.93 35.95 QF-Attack [45] 34.92 37.60 55.48 66.79 62.95 52.12 41.60 37.00 MMP-Attack [38] 45.90 36.98 68.36 78.74 59.11 48.22 45.23 44.57 MMA-Diffusion [39] 33.18 39.45 53.75 68.89 60.90 48.59 39.13 34.91 Antelope (Ours) 31.73 36.63 53.83 66.09 62.17 47.80 38.16 34.27
Violence SDv14 [31] SDv21 [31] ESD [8] SafeGen [15] SLD-max [34] SLD-strong [34] SLD-medium [34] SLD-weak [34] ASR (%) SneakyPrompt [41] 25.42 33.90 30.51 45.76 32.20 28.81 25.42 30.51 QF-Attack [45] 33.90 40.68 33.90 47.46 30.51 27.12 32.20 30.51 MMP-Attack [38] 54.24 47.46 35.59 66.10 23.73 30.51 33.90 35.59 MMA-Diffusion [39] 44.07 45.76 33.90 54.24 23.73 25.42 27.12 30.51 Antelope (Ours) 54.24 40.68 40.68 74.58 32.20 35.59 35.59 42.37 FID SneakyPrompt [41] 50.75 58.22 64.94 61.31 73.52 63.56 56.03 52.63 QF-Attack [45] 50.71 56.30 61.98 61.48 73.93 63.59 58.23 55.41 MMP-Attack [38] 44.57 58.22 66.93 68.66 79.27 66.69 57.30 54.78 MMA-Diffusion [39] 49.61 59.74 60.22 60.49 72.02 62.76 56.50 54.32 Antelope (Ours) 47.94 55.30 55.36 65.60 73.41 60.04 53.34 49.87
Time (s) Antelope our MMP-Att. gradient MMA-Dif. gradient QF-Att. greedy QF-Att. genetic QF-Att. pgd SneakyPro. rl SneakyPro. greedy SneakyPro. brute SneakyPro. beam Nudity 56 310 1871 46 30 173 105 1200∗ 117 349 Violence 54 330 1911 57 37 253 319 146 2700∗ 223
4.1 Experimental Setting
Setup. We implement Antelope using Python 3.8.10 and PyTorch 1.10.2 on a Ubuntu 20.04 server, conducting all experiments on a single A100 GPU. We set , , a learning rate of 0.001, and conduct 2000 iterations.
Datasets. We evaluate the performance of Antelope using the Inappropriate Image Prompt (I2P) dataset [13] which is disproportionately likely to produce inappropriate images in generative Text-to-Image (T2I) tasks. Although the prompts in this dataset avoid explicit sensitive words, they can still prompt T2I models lacking safety checkers to generate images with explicit NSFW content. However, the dataset becomes ineffective when safety checkers exist. In our experiments, we select 333 prompts with a harm rating exceeding 90% for nudity, labeled NSFW-333, and 59 prompts with a similar harm rating for violence, labeled NSFW-59.
Detector. To classify whether images contain nudity, we employ the NudeNet detector [22] which flags an image as nudity if any of the following labels are detected: GENITALIA_EXPOSED, BREAST_EXPOSED, BUTTOCKS_EXPOSED and ANUS_EXPOSED. For identifying images with harmful content, such as depictions of blood or violence, we utilize the Q16 classifier [33].
Metrics. (1) Attack Success Rate (ASR): ASR quantifies the attack’s effectiveness, calculated as the ratio of adversarial prompts that bypass the NSFW detector to the total number of adversarial prompts. A higher ASR indicates a more effective attack. For ASR computation, we instruct the T2I models to generate five images per prompt. If any of these images exhibit NSFW content and evade detection by our NSFW checker, the attack is deemed successful. (2) Frechet Inception Distance (FID): FID measures the semantic similarity of generated images to real images, where a lower FID score signifies closer alignment with realistic imagery. We generate 1,000 images as a ground truth dataset using raw NSFW prompts in a No Attack setting and calculate the FID between our generated samples and this reference dataset.
Offline Baselines. We evaluate attack performance on the SDv14 model [31] with an integrated safety checker, comparing Antelope against four existing jailbreak attack methods: SneakyPrompt (RL) [41], QF-Attack (greedy) [45], MMP-Attack [38] and MMA-Diffusion (text-modal) [39], implementing each according to their official specifications. Additionally, we employ four defensive baselines: SDv2.1 [31], ESD [8], SafeGen [15], and various configurations of SLD (max, strong, medium, weak) [34], to assess the efficacy of these attack methods when encountering enhanced defenses.
Online services. To evaluate the robustness and transferability of our method on black-box interfaces, we test whether adversarial prompts bypass the NSFW filters and generate inappropriate images on two popular online platforms: Midjourney [20] and Leoanrdo.AI [14].
4.2 Experimental Results
Evaluation on offline baselines across defensive methods. Table1 and Table2 present the ASR and FID scores of different attack methods against various defensive baselines for the “nudity” and “violence” target attributes. For each defense method, the best-performing results in each column are highlighted in bold, while the second-best results are underlined. We have several key observations. Firstly, Antelope consistently achieves the highest ASR and FID performance in most cases, demonstrating its effectiveness and superiority in bypassing defenses while maintaining image quality. Secondly, MMP-Attack and MMA-Diffusion show comparatively higher attack success rates, while SneakyPrompt and QF-Attack have lower ASR. Thirdly, FID scores reveal no significant differences between the various attack methods, indicating similar levels of image fidelity. Lastly, ESD shows the strongest defense for the “nudity” target attribute, while SLD-max is the most effective defense for the “violence” target attribute.
Performance on online services. Figure6 displays the attack effects of Antelope on various T2I services, including Midjourney and Leoanrdo.AI , compared with offline Stable Diffusion (SDv14). In these experiments, we set the parameter to values of 4, 5, and 6, selecting original prompts from both simple descriptions and the I2P dataset. We then apply the adversarial prompts generated by SDv14 directly to the online services. Our findings show that Antelope exhibits robust concealment, alignment, and resilience across platforms. Additionally, we observe distinct filtering tendencies: (1) Midjourney enforces a more stringent screening process for nudity and adult content but is comparatively permissive toward generating images with bloody, violent, or unsettling themes. (2) Leonardo.AI , conversely, shows a higher tolerance for nudity yet is more restrictive regarding the production of violent images.
Efficiency analysis. We measure the time required to search for a single adversarial prompt across various attack methods, as shown in Tab.3. Time is in seconds. To ensure a comprehensive comparison, we test each method using all search strategies recommended in the official implementations. The evaluation spans the entire dataset, with average search times calculated for per prompt. While QF-Attack demonstrates relatively fast search times, it underperforms in attack success rate and alignment, as observed in prior analysis. Conversely, MMP-Attack and MMA-Diffusion show lower efficiency due to slower search processes. For SneakyPrompt, both the greedy and brute-force strategies are proved unstable, with some prompt searches leading to prolonged and unpredictable times, and denoted as ∗ to indicate ambiguous timing. Our results show that Antelope consistently delivers stable and high-efficiency performance across trials, highlighting its practical advantage for adversarial prompt generation in time-sensitive scenarios.
4.3 Ablation Study
ASR 0.0 0.2 0.4 0.6 0.8 1.0 Nudity 50 60 50 20 10 20 Violence 70 80 40 40 30 30
ASR 1 2 3 4 5 6 7 8 Nudity 10 40 40 60 60 60 50 50 Violence 50 60 40 80 80 70 50 60
We conduct a series of experiments to identify the threshold , loss weight , and the number of searched tokens for achieving the best performance with Antelope.
To determine the best value, we disable the threshold judgment module for and fix . We then select 10 representative prompts for each target attribute, nudity and violence, increasing incrementally from 0.0 to 1.0 with an interval of 0.2. For each setting, we generate adversarial prompts and produce 5 images per prompt to measure the ASR. We observe that yields the highest ASR. Similarly, for identifying the optimal , we fix and increase from 1 to 8, finding that or achieves the best results. The corresponding outcomes are detailed in Tab.4 and Tab.5.Additionally, we calculate the minimum average loss function values for each variant and visualize them in Fig.7. The optimal points for and or are marked with stars. At these optimal hyperparameter points, the loss values approach 0.7, leading us to set in our experiments.
5 Ethics Statement
This research may produce some socially harmful content, but our aim is to reveal security vulnerabilities in the T2I diffusion models and further strengthen these systems, rather than allowing abuse. We urge developers to responsibly use our findings to improve the security of T2I models. We advocate for raising ethical awareness in AI research, especially in generative models, and jointly build an innovative, intelligent, practical, safe, and ethical AI system.
6 Conclusion
In this paper, we introduce a potent and concealed attack strategy, Antelope, which effectively bypasses diverse safety checkers in Text-to-Image (T2I) models to generate Not-Safe-for-Work (NSFW) imagery. Through the incorporation of semantic alignment and early stopping mechanisms, Antelope addresses challenges of low search efficiency, poor concealment, and misalignment present in existing attack methods, achieving superior performance and robustness across multiple evaluation metrics. Our work further reveals critical vulnerabilities in popular image generation models and provides valuable insights for enhancing model security against evolving adversarial attack techniques, which is vital for societal safety. Nonetheless, due to structural and defensive variations across different models, the attack success rates of Antelope on unfamiliar online models remain relatively low. Consequently, our future research will focus on devising more effective strategies for attacking black-box models and refining corresponding defense mechanisms.
References
- Betker etal. [2023]James Betker, Gabriel Goh, Li Jing, TimBrooks, Jianfeng Wang, Linjie Li, LongOuyang, JuntangZhuang, JoyceLee, YufeiGuo, WesamManassra, PrafullaDhariwal, CaseyChu, YunxinJiao, and Aditya Ramesh.Improving image generation with better captions, 2023.
- Brown etal. [2020]TomB. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, DanielM. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei.Language models are few-shot learners.In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- Chavhan etal. [2024]Ruchika Chavhan, Da Li, and Timothy Hospedales.Conceptprune: Concept editing in diffusion models via skilled neuron pruning, 2024.
- Chou etal. [2023]Sheng-Yen Chou, Pin-Yu Chen, and Tsung-Yi Ho.How to backdoor diffusion models?In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4015–4024, 2023.
- Deng and Chen [2024]Yimo Deng and Huangxun Chen.Divide-and-conquer attack: Harnessing the power of llm to bypass safety filters of text-to-image models, 2024.
- Devlin etal. [2019]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.BERT: Pre-training of deep bidirectional transformers for language understanding.In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics.
- Dosovitskiy etal. [2021]Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby.An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
- Gandikota etal. [2023]Rohit Gandikota, Joanna Materzyńska, Jaden Fiotto-Kaufman, and David Bau.Erasing concepts from diffusion models.In Proceedings of the 2023 IEEE International Conference on Computer Vision, 2023.
- Gao etal. [2024]Sensen Gao, Xiaojun Jia, Yihao Huang, Ranjie Duan, Jindong Gu, Yang Liu, and Qing Guo.Rt-attack: Jailbreaking text-to-image models via random token, 2024.
- He etal. [2024]Yutong He, Alexander Robey, Naoki Murata, Yiding Jiang, Joshua Williams, GeorgeJ Pappas, Hamed Hassani, Yuki Mitsufuji, Ruslan Salakhutdinov, J.Zico Kolter, AI Sony, SonyGroup Corporation, and Bosch Center.Automated black-box prompt engineering for personalized text-to-image generation.ArXiv, abs/2403.19103, 2024.
- Ho etal. [2020]Jonathan Ho, Ajay Jain, and Pieter Abbeel.Denoising diffusion probabilistic models.CoRR, abs/2006.11239, 2020.
- Huang etal. [2024]Yihao Huang, Le Liang, Tianlin Li, Xiaojun Jia, Run Wang, Weikai Miao, Geguang Pu, and Yang Liu.Perception-guided jailbreak against text-to-image models, 2024.
- [13]I2P.Inappropriate image prompts.https://huggingface.co/datasets/AIML-TUDA/i2p.
- Leonardo.Ai [2023]Leonardo.Ai.Leonardo.ai, 2023.https://leonardo.ai/.
- Li etal. [2024]Xinfeng Li, Yuchen Yang, Jiangyi Deng, Chen Yan, Yanjiao Chen, Xiaoyu Ji, and Wenyuan Xu.Safegen: Mitigating sexually explicit content generation in text-to-image models.In arXiv preprint arXiv:2404.06666, 2024.
- Liu etal. [2024]Yi Liu, Guowei Yang, Gelei Deng, Feiyue Chen, Yuqi Chen, Ling Shi, Tianwei Zhang, and Yang Liu.Groot: Adversarial testing for generative text-to-image models with tree-based semantic transformation, 2024.
- Ma etal. [2024]Jiachen Ma, Anda Cao, Zhiqing Xiao, Jie Zhang, Chao Ye, and Junbo Zhao.Jailbreaking prompt attack: A controllable adversarial attack against diffusion models, 2024.
- Madry etal. [2019]Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.Towards deep learning models resistant to adversarial attacks, 2019.
- Mansimov etal. [2016]Elman Mansimov, Emilio Parisotto, JimmyLei Ba, and Ruslan Salakhutdinov.Generating images from captions with attention, 2016.
- Midjourney [2023]Midjourney.Midjourney, 2023.https://www.midjourney.com/.
- Nichol etal. [2022]AlexanderQuinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen.GLIDE: towards photorealistic image generation and editing with text-guided diffusion models.In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, pages 16784–16804. PMLR, 2022.
- [22]NudeNet.Nudenet.https://github.com/notAI-tech/NudeNet.
- OpenAI [2023]OpenAI.Chatgpt, 2023.https://chatgpt.com.
- Peebles and Xie [2022]William Peebles and Saining Xie.Scalable diffusion models with transformers.arXiv preprint arXiv:2212.09748, 2022.
- Peng etal. [2024]Duo Peng, Qiuhong Ke, and Jun Liu.Upam: Unified prompt attack in text-to-image generation models against both textual filters and visual checkers, 2024.
- Qu etal. [2023]Yiting Qu, Xinyue Shen, Xinlei He, Michael Backes, Savvas Zannettou, and Yang Zhang.Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models, 2023.
- Radford etal. [2021]Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever.Learning transferable visual models from natural language supervision.In International Conference on Machine Learning, 2021.
- Ramesh etal. [2021]Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.Zero-shot text-to-image generation.In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, pages 8821–8831. PMLR, 2021.
- Ramesh etal. [2022]Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen.Hierarchical text-conditional image generation with clip latents, 2022.
- Rando etal. [2022]Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, and Florian Tramèr.Red-teaming the stable diffusion safety filter, 2022.
- Rombach etal. [2022]Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.High-resolution image synthesis with latent diffusion models.In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10674–10685. IEEE, 2022.
- Ronneberger etal. [2015]Olaf Ronneberger, Philipp Fischer, and Thomas Brox.U-net: Convolutional networks for biomedical image segmentation, 2015.
- Schramowski etal. [2022]Patrick Schramowski, Christopher Tauchmann, and Kristian Kersting.Can machines help us answering question 16 in datasheets, and in turn reflecting on inappropriate content?In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT), 2022.
- Schramowski etal. [2023]Patrick Schramowski, Manuel Brack, Björn Deiseroth, and Kristian Kersting.Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models, 2023.
- Song etal. [2020]Jiaming Song, Chenlin Meng, and Stefano Ermon.Denoising diffusion implicit models.CoRR, abs/2010.02502, 2020.
- Tsai etal. [2024]Yu-Lin Tsai, Chia-Yi Hsu, Chulin Xie, Chih-Hsun Lin, Jia-You Chen, Bo Li, Pin-Yu Chen, Chia-Mu Yu, and Chun-Ying Huang.Ring-a-bell! how reliable are concept removal methods for diffusion models?, 2024.
- Vaswani etal. [2017]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanN. Gomez, Lukasz Kaiser, and Illia Polosukhin.Attention is all you need.CoRR, abs/1706.03762, 2017.
- Yang etal. [2024a]Dingcheng Yang, Yang Bai, Xiaojun Jia, Yang Liu, Xiaochun Cao, and Wenjian Yu.On the multi-modal vulnerability of diffusion models.In Trustworthy Multi-modal Foundation Models and AI Agents (TiFA), 2024a.
- Yang etal. [2024b]Yijun Yang, Ruiyuan Gao, Xiaosen Wang, Tsung-Yi Ho, Nan Xu, and Qiang Xu.Mma-diffusion: Multimodal attack on diffusion models.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7737–7746, 2024b.
- Yang etal. [2024c]Yijun Yang, Ruiyuan Gao, Xiao Yang, Jianyuan Zhong, and Qiang Xu.Guardt2i: Defending text-to-image models from adversarial prompts, 2024c.
- Yang etal. [2024d]Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, and Yinzhi Cao.Sneakyprompt: Jailbreaking text-to-image generative models.In Proceedings of the IEEE Symposium on Security and Privacy, 2024d.
- Yu etal. [2021]Jiahui Yu, Xin Li, JingYu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu.Vector-quantized image modeling with improved vqgan.ArXiv, abs/2110.04627, 2021.
- Zhang etal. [2024]Yimeng Zhang, Jinghan Jia, Xin Chen, Aochuan Chen, Yihua Zhang, Jiancheng Liu, Ke Ding, and Sijia Liu.To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images … for now, 2024.
- Zhao etal. [2025]Xin Zhao, Xiaojun Chen, Xudong Chen, He Li, Tingyu Fan, and Zhendong Zhao.Cipherdm: Secure three-party inference for diffusion model sampling.In Computer Vision – ECCV 2024, pages 288–305, Cham, 2025. Springer Nature Switzerland.
- Zhuang etal. [2023]Haomin Zhuang, Yihua Zhang, and Sijia Liu.A pilot study of query-free adversarial attack against stable diffusion.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 2385–2392, 2023.