Antelope: Potent and Concealed Jailbreak Attack Strategy (2025)

Xin Zhao
Institute of Information Engineering
Chinese Academy of Sciences, China
zhaoxin@iie.ac.cn
  Xiaojun Chen
Institute of Information EngineeringChina
Chinese Academy of Sciences, China
chenxiaojun@iie.ac.cn
  Haoyu Gao
School of Computer Science
Georgia Institute of Technology, USA
gao.howard517@gmail.com

Abstract

Due to the remarkable generative potential of diffusion-based models, numerous researches have investigated jailbreak attacks targeting these frameworks. A particularly concerning threat within image models is the generation of Not-Safe-for-Work (NSFW) content. Despite the implementation of security filters, numerous efforts continue to explore ways to circumvent these safeguards. Current attack methodologies primarily encompass adversarial prompt engineering or concept obfuscation, yet they frequently suffer from slow search efficiency, conspicuous attack characteristics and poor alignment with targets. To overcome these challenges, we propose Antelope, a more robust and covert jailbreak attack strategy designed to expose security vulnerabilities inherent in generative models. Specifically, Antelope leverages the confusion of sensitive concepts with similar ones, facilitates searches in the semantically adjacent space of these related concepts and aligns them with the target imagery, thereby generating sensitive images that are consistent with the target and capable of evading detection. Besides, we successfully exploit the transferability of model-based attacks to penetrate online black-box services. Experimental evaluations demonstrate that Antelope outperforms existing baselines across multiple defensive mechanisms, underscoring its efficacy and versatility.

Disclaimer: This paper contains unsafe imagery that might be offensive to some readers.

1 Introduction

Recent advancements have highlighted the revolutionary capabilities of generative models, particularly those utilizing transformer [37, 2, 6] and diffusion [11, 35] architectures. The convergence of these technologies has produced increasingly powerful models for image [31, 28, 29, 21] and video [7, 24] generation. However, the vulnerabilities inherent in these models give rise to emerging safety concerns [44, 16, 4, 26, 30, 41]. Chief among these is the issue of misalignment, which facilitates the generation of harmful or inappropriate content, such as Not-Safe-for-Work (NSFW) imagery that includes nudity, violence, gore, and other potential sensitive materials [26, 40].

To mitigate the issue of inappropriate generation, developers of Text-to-Image (T2I) models have implemented external defense measures like text filters [28, 29, 20] and image filters [31], as shown in Fig.1. Additionally, significant efforts [15, 34, 8] have been made to enhance the internal safety and robustness of T2I models through retraining or fine-tuning. Under these defense mechanisms, the generation of explicit NSFW content from inappropriate prompts is effectively blocked, while normal prompts continue to produce appropriate and non-sensitive imagery.

Despite these safeguards, the inherent ambiguity in the text space and the misalignment between text and image space sustainably create fertile ground for jailbreak attacks, which seek to circumvent the system’s safety and ethical guardrails. For instance, SneakyPrompt [41] demonstrates that perturbing similar words (e.g., “nice” vs. “n1ce”) or using synonyms to paraphrase input while preserving the original semantics can alter the prediction results of T2I models. As illustrated in Fig.1, the overarching goal of jailbreaking T2I models is to craft adversarial prompts that, while classified as benign, generate harmful imagery capable of evading multiple defense mechanisms. To achieve this, JPA [17] identifies adversarial prompts within the sensitive regions of text space by appending learnable tokens to each input, while SneakyPrompt [41] employs reinforcement learning to uncover such adversarial prompts. MMA [39] introduces a greedy search method based on gradient optimization, which requires perturbations in both the text and image modalities to bypass post-synthesis image checkers. Although these methods demonstrate successful jailbreaks, they are computationally expensive due to the extensive search process. In contrast, QF-Attack [45] employs three strategies (greedy, genetic, and PGD [18]) and achieves greater efficiency by restricting the search space to a character table. However, the adversarial prompts generated by QF-Attack [45] suffer from poor alignment with the original semantic intent of the target image.

Antelope: Potent and Concealed Jailbreak Attack Strategy (1)
Antelope: Potent and Concealed Jailbreak Attack Strategy (2)

Additionally, our analysis of existing jailbreak attack methods reveals that their adversarial prompts frequently contain superfluous or nonsensical words and symbols, making anomalies easy to observe and detect. Such prompts may produce sensitive images when tested on offline models such as Stable Diffusion [31]. However, on more advanced models like GPT-4o [23] and Midjourney [20], these prompts are flagged and require further clarification, which can be seen in Fig.2. This finding underscores the need for developing more subtle adversarial prompts capable of bypassing various safety mechanisms.

In light of these challenges, our primary task is to develop an efficient method for searching adversarial prompts that can bypass safety content moderation systems. Our approach is guided by three key objectives:

Objective I: Identifying adversarial prompts that can effectively bypass safety filters.

Objective II: Improving the alignment and concealment of adversarial prompts.

Objective III: Minimizing the total searching time.

To achieve Objective I, we replace conspicuous adversarial terms in the original prompts and append specific suffix words to compose adversarial prompts that can bypass safety filters. For Objective II, we ensure that these suffix words are inconspicuous and maintain high cosine similarity between the adversarial text embedding and both the reference image embedding and the original text embedding, offering strong alignment and concealment. To meet Objective III, we optimize the search process by filtering candidate vocabulary list, setting optimal threshold, and implementing early stopping upon identifying a suitable prompt.

The main contributions are summarized as follows:

  • \bullet

    We design and implement a highly effective jailbreak attack strategy, Antelope, to explore adversarial prompts that can bypass the safety mechanisms of T2I models.

  • \bullet

    Antelope is compared with multiple attack methods across various defense baselines, demonstrating outstanding superiority and exceptional robustness.

  • \bullet

    Extensive evaluation and analysis highlight Antelope’s efficiency in generating adversarial prompts with minimal detection risk and high semantic alignment.

2 Related Work

Antelope: Potent and Concealed Jailbreak Attack Strategy (3)

Defensive methods against NSFW generation.Current defense strategies for Text-to-Image (T2I) models can be broadly divided into external and internal defenses. External defenses typically involve post-hoc content moderation, employing prompt checkers to identify and filter malicious prompts or image checkers to censor NSFW elements in synthesized images. For instance, Rando etal. describe how the Stable Diffusion safety filter blocks images that closely resemble any of 17 pre-defined “sensitive concepts” within CLIP model’s embedding space. Similarly, platforms such as Dall\cdote 3 [1], Leonardo.Ai [14], and Midjourney [20] implement prompt checkers that detect and reject malicious prompts upon submission. Internal defenses, on the other hand, focus on model-level modifications to eliminate unsafe content. ConceptPrune [3] demonstrates that neurons in latent diffusion models (LDMs) [31] often specialize in specific concepts like nudity and that pruning these neurons can permanently eliminate undesired concepts from image generation. Approaches like ESD [8] and SLD [34] employ model fine-tuning to directly reduce NSFW outputs, enhancing the intrinsic safety of T2I models. To counter jailbreak attempts via text prompts, SafeGen [15] modifies self-attention layers within the model, effectively filtering out unsafe visual representations regardless of the textual input. In this paper, we intend to explore potential strategies that can effectively bypass these defense mechanisms.

Adversarial attacks on T2I models.SurrogatePrompt [41] and DACA [5] harness the power of large language models (LLMs) [2, 23] to substitute explicit words or disassemble unethical prompts into benign descriptions of individual elements, successfully bypassing safety filters of T2I models like Midjourney [20] and Dall\cdote 2 [29]. Rather than relying on auxiliary models or tools, other works [36, 43, 45] focus on internal mechanisms such as concept retrieval [36] or concept removal [43, 45] to achieve attacks. However, Ring-A-Bell [36] lacks precise control over synthesis specifics, and UnlearnDiff [43] offers limited effectiveness against more comprehensive defense strategies. Notably, QF-Attack [45] empirically shows that a subtle five-character perturbation can induce significant content shifts in images synthesized by Stable Diffusion [31], though it risks misalignment due to simple character substitution. Furthermore, SneakyPrompt [41] leverages reinforcement learning to substitute explicit target words in the original prompts, while MMP-Attack [38] effectively replaces primary objects in images by appending optimized suffixes. Additionally, both MMP-Attack [38] and RT-attack [9] specifically align adversarial prompts with reference images, which effectively increase similarity scores and enhance alignment with target images. The primary distinction of PRISM [10] and MMA-Diffusion [39] from previous methods lies in their approach of updating the entire sampling distribution of prompts, rather than directly modifying individual prompt tokens or embeddings. Inspired by gradient-based optimization in NLP (Natural Language Processing), MMA-Diffusion [39] and UPAM [25] apply token-level gradients for refined optimization, yet this method often suffers from inefficiencies inherent to gradient-driven approaches. In this work, we aim to develop a more efficient method for identifying adversarial prompts that not only evade content moderation systems but also maintain strong alignment and concealment.

3 Methodology

3.1 Preliminary

Text-to-Image (T2I) models, initially demonstrated by Mansimov etal. [19], generate synthetic images from natural language descriptions known as prompts. These models typically consist of a language model to process the input prompt, such as BERT [6] or CLIP’s text encoder [27], and an image generation module like VQGAN [42] or diffusion model [11] for synthesizing images. In case of Stable Diffusion [31], a pre-trained CLIP encoder 𝒯:XE:𝒯𝑋𝐸\mathcal{T}:X\rightarrow Ecaligraphic_T : italic_X → italic_E is utilized to tokenize and project a text xX𝑥𝑋x\in Xitalic_x ∈ italic_X into its corresponding embedding representation eE𝑒𝐸e\in Eitalic_e ∈ italic_E. The text embedding guides the image generation process which is facilitated by a latent diffusion model. This model compresses the image space into a lower-dimensionallatent space, and utilize a U-Net [32] architecture to sample images. The architecture serve as a Markovian hierarchical denoising autoencoder to generate images by sampling from random latent Gaussian noise and iteratively denoising the sample. Once the denoising process is complete, the latent representation is decoded back into image space by an image decoder 𝒟:EY:𝒟𝐸𝑌\mathcal{D}:E\rightarrow Ycaligraphic_D : italic_E → italic_Y.

3.2 Threat Model

In this study, we conduct a comprehensive evaluation of the impact of Antelope on robust T2I models across two practical attack scenarios.

White-Box Setting: Adversaries exploit open-source T2I models like SDv14 [31] for image generation, with full access to the model’s architecture, checkpoints, and integrated safety mechanisms. However, attackers do not alter the model’s architecture or parameters; rather, they focus on utilizing the outputs produced by the model’s components (i.e., text encoder and image encoder) to perform in-depth exploration and analysis that inform their attack strategies.

Black-Box Setting: Attackers generate images using online T2I services like Midjourney [20] and Leoanrdo.AI [14]. Without direct access to proprietary model parameters or visibility into the integrated safety mechanisms, they merely rely on transfer attacks. By interacting with these services, adversaries adapt their jailbreaking methods to effectively bypass internal safety measures.

3.3 System Design

Antelope: Potent and Concealed Jailbreak Attack Strategy (4)

Given a T2I model 𝒢𝒢\mathcal{G}caligraphic_G, we define the following functions: the text encoder 𝒯:XE:𝒯𝑋𝐸\mathcal{T}:X\rightarrow Ecaligraphic_T : italic_X → italic_E, which tokenizes and projects text inputs xX𝑥𝑋x\in Xitalic_x ∈ italic_X into text embeddings eE𝑒𝐸e\in Eitalic_e ∈ italic_E; the image encoder :YE:𝑌𝐸\mathcal{I}:Y\rightarrow Ecaligraphic_I : italic_Y → italic_E, which projects images yY𝑦𝑌y\in Yitalic_y ∈ italic_Y into image embeddings eE𝑒𝐸e\in Eitalic_e ∈ italic_E; and the image decoder 𝒟:EY:𝒟𝐸𝑌\mathcal{D}:E\rightarrow Ycaligraphic_D : italic_E → italic_Y, which decodes image embeddings eE𝑒𝐸e\in Eitalic_e ∈ italic_E back into images yY𝑦𝑌y\in Yitalic_y ∈ italic_Y. Let posubscript𝑝𝑜p_{o}italic_p start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT represent an original prompt and tasubscript𝑡𝑎t_{a}italic_t start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT denote a target attribute type (i.e. “nudity” or “violence”). Our objective is to explore an adversarial prompt pasubscript𝑝𝑎p_{a}italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT which can generate sensitive images that not only reflect the specified attribute type and original prompt but also successfully bypass all safety check mechanisms.

The overall pipeline of Antelope is illustrated in Fig.3. The process begins by identifying and replacing adversarial terms in the original prompt posubscript𝑝𝑜p_{o}italic_p start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT to create a clean prompt pcsubscript𝑝𝑐p_{c}italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT that can effectively pass the text checker.Next, we select several candidate token pairs, “negative” and “positive” indicating with and without sensitive semantics respectively, according to the target attribute type. The text embedding Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be obtained by adding positive embedding Epsubscript𝐸𝑝E_{p}italic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and subtracting negative embedding Ensubscript𝐸𝑛E_{n}italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT from the clean prompt embedding Ecsubscript𝐸𝑐E_{c}italic_E start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, allowing us to align Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with the adversarial prompt embedding Ec||sE_{c||s}italic_E start_POSTSUBSCRIPT italic_c | | italic_s end_POSTSUBSCRIPT. Here ‘||||| |’ means concatenation and ‘s’ means suffix tokens. For image alignment, we calculate the similarity between Ec||sE_{c||s}italic_E start_POSTSUBSCRIPT italic_c | | italic_s end_POSTSUBSCRIPT and the reference image embedding Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. A threshold is set for the combined text and image similarity score. If the similarity loss surpasses this threshold, we generate images from the adversarial prompt and verify if they pass the NSFW filter. Once the generated images bypass the filter, we output the adversarial prompt; otherwise, we continue the search process.

Antelope: Potent and Concealed Jailbreak Attack Strategy (5)

Similar Token Selection.We simulate the distribution of both negative and positive prompts from machine view and human view, as illustrated in Fig.4. Intuitively, for prompts with similar sentiment, the distribution of consistent judgments between human and machine should be dense, whereas opposing interpretations should exhibit a sparser distribution. Inspired by PGJ [12] and its PSTSI principle (i.e., identifying a safe substitution phrase that is perceptually similar to the target unsafe words but semantically divergent), we propose the hypothesis: prompts that are distant in semantic space may still generate visually similar outputs in image space, and conduct a preliminary experiment to validate this. As shown in Fig.5, we generate 50 prompts for each concept (red blood, red liquid, red pigment and watermelon juice) and visualize the embeddings of both text prompts and corresponding images by TSNE. These visualizations clearly indicate that, although these concepts differ significantly in semantic space, they show no obvious diversity in image space. Besides, JPA [17] claims that the semantic attributes embedded in these soft embeddings, which can be added or subtracted, stem from the initial semantic alignment capability in the pre-trained text space. Building on these insights, we use ChatGPT [23] to select positive and negative token pairs based on the target attribute type tasubscript𝑡𝑎t_{a}italic_t start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT. These token pairs are then sent into text encoders to obtain their respective embeddings Epsubscript𝐸𝑝E_{p}italic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and Ensubscript𝐸𝑛E_{n}italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

Adversarial Text Search.To clarify, the adversarial text prompts we search for should meet both alignment and concealment, which is a difficult task as these two objectives usually conflict each other. To address this, we break down the process into several steps. For concealment, we first preprocess original prompts by replacing explicit adversarial terms, as text filters may reject prompts with direct NSFW indicators. To preserve semantic integrity, we selectively replace words that convey strong harmfulness. For example, in the original prompt “two violent persons”, we substitute “violent” with “crying”, creating a new prompt “two crying persons” which is harmless. However, this substitution significantly alters the prompt’s original meaning. To achieve text alignment, we reintroduce the concept by adding the “violent” embedding. Rather than directly adding a sensitive embedding, we leverage our selected token pairs. Positive and negative tokens may differ in textual semantics yet yield similar imagery. By subtracting the negative embedding Ensubscript𝐸𝑛E_{n}italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and adding the positive embedding Epsubscript𝐸𝑝E_{p}italic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, we obtain the adjusted text embedding Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for text alignment, which can be formulated as:

Et=EcEn+Epsubscript𝐸𝑡subscript𝐸𝑐subscript𝐸𝑛subscript𝐸𝑝E_{t}=E_{c}-E_{n}+E_{p}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT(1)

For image alignment, we first generate multiple images from the original prompt using offline models without safety checkers and select one suitable image as the reference. This reference image is then processed by the image decoder to obtain the image embedding Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For the adversarial prompt, we define it as the clean prompt with an appended suffix of 𝒩𝒩\mathcal{N}caligraphic_N tokens, setting 𝒩=4𝒩4\mathcal{N}=4caligraphic_N = 4 or 5555 in this case. To search for such a prompt, we begin by removing NSFW-related tokens associated with the target attribute, and then search through the remaining vocabulary list. We define the text loss function as the cosine similarity between Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Ec||sE_{c||s}italic_E start_POSTSUBSCRIPT italic_c | | italic_s end_POSTSUBSCRIPT, and the image loss function as the cosine similarity between Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Ec||sE_{c||s}italic_E start_POSTSUBSCRIPT italic_c | | italic_s end_POSTSUBSCRIPT, which can be clarified as Eq.2 and Eq.3:

txt=1cos(Ec||s,Et)\mathcal{L}_{txt}=1-cos(E_{c||s},E_{t})caligraphic_L start_POSTSUBSCRIPT italic_t italic_x italic_t end_POSTSUBSCRIPT = 1 - italic_c italic_o italic_s ( italic_E start_POSTSUBSCRIPT italic_c | | italic_s end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )(2)
img=1cos(Ec||s,Ei)\mathcal{L}_{img}=1-cos(E_{c||s},E_{i})caligraphic_L start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT = 1 - italic_c italic_o italic_s ( italic_E start_POSTSUBSCRIPT italic_c | | italic_s end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )(3)

Then our learning objective is to optimize Eq.4 where γ𝛾\gammaitalic_γ is a weighting factor to balance the loss terms between the image and text modalities.

mins:=γtxt+(1γ)imgmin_{s}:\quad\mathcal{L}=\gamma\mathcal{L}_{txt}+(1-\gamma)\mathcal{L}_{img}italic_m italic_i italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT : caligraphic_L = italic_γ caligraphic_L start_POSTSUBSCRIPT italic_t italic_x italic_t end_POSTSUBSCRIPT + ( 1 - italic_γ ) caligraphic_L start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT(4)

Jailbreak Safety Checker.To efficiently bypass safety checkers, we implement a two-fold judgment strategy. Firstly, we set a threshold τ𝜏\tauitalic_τ for the loss function. If >τ𝜏\mathcal{L}>\taucaligraphic_L > italic_τ, the algorithm continues searching. Once <τ𝜏\mathcal{L}<\taucaligraphic_L < italic_τ, the selected adversarial prompt pasubscript𝑝𝑎p_{a}italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is fed into the T2I model 𝒢𝒢\mathcal{G}caligraphic_G to generate images, which are subsequently evaluated by the NSFW filter. If the generated images pass the filter, the adversarial prompt is returned and the process is terminated; otherwise, the search process continues iteratively.

4 Experiment

NuditySDv14 [31]SDv21 [31]ESD [8]SafeGen [15]SLD-max [34]SLD-strong [34]SLD-medium [34]SLD-weak [34]
ASR (%) \uparrow
SneakyPrompt [41]66.3643.9311.5360.4418.6924.3046.1157.01
QF-Attack [45]70.2749.5511.1158.8618.3228.8345.0556.76
MMP-Attack [38]74.2951.4711.0271.1721.2629.5869.4973.09
MMA-Diffusion [39]70.8748.6515.0269.0727.3335.4458.8665.17
Antelope (Ours)81.9857.9612.9168.4734.5350.7574.4781.08
FID \downarrow
SneakyPrompt [41]34.2040.0354.9962.4162.1551.1239.9335.95
QF-Attack [45]34.9237.6055.4866.7962.9552.1241.6037.00
MMP-Attack [38]45.9036.9868.3678.7459.1148.2245.2344.57
MMA-Diffusion [39]33.1839.4553.7568.8960.9048.5939.1334.91
Antelope (Ours)31.7336.6353.8366.0962.1747.8038.1634.27

ViolenceSDv14 [31]SDv21 [31]ESD [8]SafeGen [15]SLD-max [34]SLD-strong [34]SLD-medium [34]SLD-weak [34]
ASR (%) \uparrow
SneakyPrompt [41]25.4233.9030.5145.7632.2028.8125.4230.51
QF-Attack [45]33.9040.6833.9047.4630.5127.1232.2030.51
MMP-Attack [38]54.2447.4635.5966.1023.7330.5133.9035.59
MMA-Diffusion [39]44.0745.7633.9054.2423.7325.4227.1230.51
Antelope (Ours)54.2440.6840.6874.5832.2035.5935.5942.37
FID \downarrow
SneakyPrompt [41]50.7558.2264.9461.3173.5263.5656.0352.63
QF-Attack [45]50.7156.3061.9861.4873.9363.5958.2355.41
MMP-Attack [38]44.5758.2266.9368.6679.2766.6957.3054.78
MMA-Diffusion [39]49.6159.7460.2260.4972.0262.7656.5054.32
Antelope (Ours)47.9455.3055.3665.6073.4160.0453.3449.87

Antelope: Potent and Concealed Jailbreak Attack Strategy (6)

Time (s)
Antelope
our
MMP-Att.
gradient
MMA-Dif.
gradient
QF-Att.
greedy
QF-Att.
genetic
QF-Att.
pgd
SneakyPro.
rl
SneakyPro.
greedy
SneakyPro.
brute
SneakyPro.
beam
Nudity56310187146301731051200117349
Violence54330191157372533191462700223

4.1 Experimental Setting

Setup. We implement Antelope using Python 3.8.10 and PyTorch 1.10.2 on a Ubuntu 20.04 server, conducting all experiments on a single A100 GPU. We set γ=0.2𝛾0.2\gamma=0.2italic_γ = 0.2, 𝒩=5𝒩5\mathcal{N}=5caligraphic_N = 5, a learning rate of 0.001, and conduct 2000 iterations.

Datasets. We evaluate the performance of Antelope using the Inappropriate Image Prompt (I2P) dataset [13] which is disproportionately likely to produce inappropriate images in generative Text-to-Image (T2I) tasks. Although the prompts in this dataset avoid explicit sensitive words, they can still prompt T2I models lacking safety checkers to generate images with explicit NSFW content. However, the dataset becomes ineffective when safety checkers exist. In our experiments, we select 333 prompts with a harm rating exceeding 90% for nudity, labeled NSFW-333, and 59 prompts with a similar harm rating for violence, labeled NSFW-59.

Detector. To classify whether images contain nudity, we employ the NudeNet detector [22] which flags an image as nudity if any of the following labels are detected: GENITALIA_EXPOSED, BREAST_EXPOSED, BUTTOCKS_EXPOSED and ANUS_EXPOSED. For identifying images with harmful content, such as depictions of blood or violence, we utilize the Q16 classifier [33].

Metrics. (1) Attack Success Rate (ASR): ASR quantifies the attack’s effectiveness, calculated as the ratio of adversarial prompts that bypass the NSFW detector to the total number of adversarial prompts. A higher ASR indicates a more effective attack. For ASR computation, we instruct the T2I models to generate five images per prompt. If any of these images exhibit NSFW content and evade detection by our NSFW checker, the attack is deemed successful. (2) Frechet Inception Distance (FID): FID measures the semantic similarity of generated images to real images, where a lower FID score signifies closer alignment with realistic imagery. We generate 1,000 images as a ground truth dataset using raw NSFW prompts in a No Attack setting and calculate the FID between our generated samples and this reference dataset.

Offline Baselines. We evaluate attack performance on the SDv14 model [31] with an integrated safety checker, comparing Antelope against four existing jailbreak attack methods: SneakyPrompt (RL) [41], QF-Attack (greedy) [45], MMP-Attack [38] and MMA-Diffusion (text-modal) [39], implementing each according to their official specifications. Additionally, we employ four defensive baselines: SDv2.1 [31], ESD [8], SafeGen [15], and various configurations of SLD (max, strong, medium, weak) [34], to assess the efficacy of these attack methods when encountering enhanced defenses.

Online services. To evaluate the robustness and transferability of our method on black-box interfaces, we test whether adversarial prompts bypass the NSFW filters and generate inappropriate images on two popular online platforms: Midjourney [20] and Leoanrdo.AI [14].

4.2 Experimental Results

Evaluation on offline baselines across defensive methods. Table1 and Table2 present the ASR and FID scores of different attack methods against various defensive baselines for the “nudity” and “violence” target attributes. For each defense method, the best-performing results in each column are highlighted in bold, while the second-best results are underlined. We have several key observations. Firstly, Antelope consistently achieves the highest ASR and FID performance in most cases, demonstrating its effectiveness and superiority in bypassing defenses while maintaining image quality. Secondly, MMP-Attack and MMA-Diffusion show comparatively higher attack success rates, while SneakyPrompt and QF-Attack have lower ASR. Thirdly, FID scores reveal no significant differences between the various attack methods, indicating similar levels of image fidelity. Lastly, ESD shows the strongest defense for the “nudity” target attribute, while SLD-max is the most effective defense for the “violence” target attribute.

Performance on online services. Figure6 displays the attack effects of Antelope on various T2I services, including Midjourney and Leoanrdo.AI , compared with offline Stable Diffusion (SDv14). In these experiments, we set the parameter 𝒩𝒩\mathcal{N}caligraphic_N to values of 4, 5, and 6, selecting original prompts from both simple descriptions and the I2P dataset. We then apply the adversarial prompts generated by SDv14 directly to the online services. Our findings show that Antelope exhibits robust concealment, alignment, and resilience across platforms. Additionally, we observe distinct filtering tendencies: (1) Midjourney enforces a more stringent screening process for nudity and adult content but is comparatively permissive toward generating images with bloody, violent, or unsettling themes. (2) Leonardo.AI , conversely, shows a higher tolerance for nudity yet is more restrictive regarding the production of violent images.

Efficiency analysis. We measure the time required to search for a single adversarial prompt across various attack methods, as shown in Tab.3. Time is in seconds. To ensure a comprehensive comparison, we test each method using all search strategies recommended in the official implementations. The evaluation spans the entire dataset, with average search times calculated for per prompt. While QF-Attack demonstrates relatively fast search times, it underperforms in attack success rate and alignment, as observed in prior analysis. Conversely, MMP-Attack and MMA-Diffusion show lower efficiency due to slower search processes. For SneakyPrompt, both the greedy and brute-force strategies are proved unstable, with some prompt searches leading to prolonged and unpredictable times, and denoted as to indicate ambiguous timing. Our results show that Antelope consistently delivers stable and high-efficiency performance across trials, highlighting its practical advantage for adversarial prompt generation in time-sensitive scenarios.

4.3 Ablation Study

ASR (%)(\%)\uparrow( % ) ↑γ𝛾\gammaitalic_γ
0.00.20.40.60.81.0
Nudity506050201020
Violence708040403030

ASR (%)(\%)\uparrow( % ) ↑𝒩𝒩\mathcal{N}caligraphic_N
12345678
Nudity1040406060605050
Violence5060408080705060

Antelope: Potent and Concealed Jailbreak Attack Strategy (7)

We conduct a series of experiments to identify the threshold τ𝜏\tauitalic_τ, loss weight γ𝛾\gammaitalic_γ, and the number of searched tokens 𝒩𝒩\mathcal{N}caligraphic_N for achieving the best performance with Antelope.

To determine the best γ𝛾\gammaitalic_γ value, we disable the threshold judgment module for τ𝜏\tauitalic_τ and fix 𝒩=5𝒩5\mathcal{N}=5caligraphic_N = 5. We then select 10 representative prompts for each target attribute, nudity and violence, increasing γ𝛾\gammaitalic_γ incrementally from 0.0 to 1.0 with an interval of 0.2. For each γ𝛾\gammaitalic_γ setting, we generate adversarial prompts and produce 5 images per prompt to measure the ASR. We observe that γ=0.2𝛾0.2\gamma=0.2italic_γ = 0.2 yields the highest ASR. Similarly, for identifying the optimal 𝒩𝒩\mathcal{N}caligraphic_N, we fix γ=0.2𝛾0.2\gamma=0.2italic_γ = 0.2 and increase 𝒩𝒩\mathcal{N}caligraphic_N from 1 to 8, finding that 𝒩=4𝒩4\mathcal{N}=4caligraphic_N = 4 or 5555 achieves the best results. The corresponding outcomes are detailed in Tab.4 and Tab.5.Additionally, we calculate the minimum average loss function values for each variant and visualize them in Fig.7. The optimal points for γ=0.2𝛾0.2\gamma=0.2italic_γ = 0.2 and 𝒩=4𝒩4\mathcal{N}=4caligraphic_N = 4 or 5555 are marked with stars. At these optimal hyperparameter points, the loss values approach 0.7, leading us to set τ=0.7𝜏0.7\tau=0.7italic_τ = 0.7 in our experiments.

5 Ethics Statement

This research may produce some socially harmful content, but our aim is to reveal security vulnerabilities in the T2I diffusion models and further strengthen these systems, rather than allowing abuse. We urge developers to responsibly use our findings to improve the security of T2I models. We advocate for raising ethical awareness in AI research, especially in generative models, and jointly build an innovative, intelligent, practical, safe, and ethical AI system.

6 Conclusion

In this paper, we introduce a potent and concealed attack strategy, Antelope, which effectively bypasses diverse safety checkers in Text-to-Image (T2I) models to generate Not-Safe-for-Work (NSFW) imagery. Through the incorporation of semantic alignment and early stopping mechanisms, Antelope addresses challenges of low search efficiency, poor concealment, and misalignment present in existing attack methods, achieving superior performance and robustness across multiple evaluation metrics. Our work further reveals critical vulnerabilities in popular image generation models and provides valuable insights for enhancing model security against evolving adversarial attack techniques, which is vital for societal safety. Nonetheless, due to structural and defensive variations across different models, the attack success rates of Antelope on unfamiliar online models remain relatively low. Consequently, our future research will focus on devising more effective strategies for attacking black-box models and refining corresponding defense mechanisms.

References

  • Betker etal. [2023]James Betker, Gabriel Goh, Li Jing, TimBrooks, Jianfeng Wang, Linjie Li, LongOuyang, JuntangZhuang, JoyceLee, YufeiGuo, WesamManassra, PrafullaDhariwal, CaseyChu, YunxinJiao, and Aditya Ramesh.Improving image generation with better captions, 2023.
  • Brown etal. [2020]TomB. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, DanielM. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei.Language models are few-shot learners.In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  • Chavhan etal. [2024]Ruchika Chavhan, Da Li, and Timothy Hospedales.Conceptprune: Concept editing in diffusion models via skilled neuron pruning, 2024.
  • Chou etal. [2023]Sheng-Yen Chou, Pin-Yu Chen, and Tsung-Yi Ho.How to backdoor diffusion models?In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4015–4024, 2023.
  • Deng and Chen [2024]Yimo Deng and Huangxun Chen.Divide-and-conquer attack: Harnessing the power of llm to bypass safety filters of text-to-image models, 2024.
  • Devlin etal. [2019]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.BERT: Pre-training of deep bidirectional transformers for language understanding.In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics.
  • Dosovitskiy etal. [2021]Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby.An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
  • Gandikota etal. [2023]Rohit Gandikota, Joanna Materzyńska, Jaden Fiotto-Kaufman, and David Bau.Erasing concepts from diffusion models.In Proceedings of the 2023 IEEE International Conference on Computer Vision, 2023.
  • Gao etal. [2024]Sensen Gao, Xiaojun Jia, Yihao Huang, Ranjie Duan, Jindong Gu, Yang Liu, and Qing Guo.Rt-attack: Jailbreaking text-to-image models via random token, 2024.
  • He etal. [2024]Yutong He, Alexander Robey, Naoki Murata, Yiding Jiang, Joshua Williams, GeorgeJ Pappas, Hamed Hassani, Yuki Mitsufuji, Ruslan Salakhutdinov, J.Zico Kolter, AI Sony, SonyGroup Corporation, and Bosch Center.Automated black-box prompt engineering for personalized text-to-image generation.ArXiv, abs/2403.19103, 2024.
  • Ho etal. [2020]Jonathan Ho, Ajay Jain, and Pieter Abbeel.Denoising diffusion probabilistic models.CoRR, abs/2006.11239, 2020.
  • Huang etal. [2024]Yihao Huang, Le Liang, Tianlin Li, Xiaojun Jia, Run Wang, Weikai Miao, Geguang Pu, and Yang Liu.Perception-guided jailbreak against text-to-image models, 2024.
  • [13]I2P.Inappropriate image prompts.https://huggingface.co/datasets/AIML-TUDA/i2p.
  • Leonardo.Ai [2023]Leonardo.Ai.Leonardo.ai, 2023.https://leonardo.ai/.
  • Li etal. [2024]Xinfeng Li, Yuchen Yang, Jiangyi Deng, Chen Yan, Yanjiao Chen, Xiaoyu Ji, and Wenyuan Xu.Safegen: Mitigating sexually explicit content generation in text-to-image models.In arXiv preprint arXiv:2404.06666, 2024.
  • Liu etal. [2024]Yi Liu, Guowei Yang, Gelei Deng, Feiyue Chen, Yuqi Chen, Ling Shi, Tianwei Zhang, and Yang Liu.Groot: Adversarial testing for generative text-to-image models with tree-based semantic transformation, 2024.
  • Ma etal. [2024]Jiachen Ma, Anda Cao, Zhiqing Xiao, Jie Zhang, Chao Ye, and Junbo Zhao.Jailbreaking prompt attack: A controllable adversarial attack against diffusion models, 2024.
  • Madry etal. [2019]Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.Towards deep learning models resistant to adversarial attacks, 2019.
  • Mansimov etal. [2016]Elman Mansimov, Emilio Parisotto, JimmyLei Ba, and Ruslan Salakhutdinov.Generating images from captions with attention, 2016.
  • Midjourney [2023]Midjourney.Midjourney, 2023.https://www.midjourney.com/.
  • Nichol etal. [2022]AlexanderQuinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen.GLIDE: towards photorealistic image generation and editing with text-guided diffusion models.In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, pages 16784–16804. PMLR, 2022.
  • [22]NudeNet.Nudenet.https://github.com/notAI-tech/NudeNet.
  • OpenAI [2023]OpenAI.Chatgpt, 2023.https://chatgpt.com.
  • Peebles and Xie [2022]William Peebles and Saining Xie.Scalable diffusion models with transformers.arXiv preprint arXiv:2212.09748, 2022.
  • Peng etal. [2024]Duo Peng, Qiuhong Ke, and Jun Liu.Upam: Unified prompt attack in text-to-image generation models against both textual filters and visual checkers, 2024.
  • Qu etal. [2023]Yiting Qu, Xinyue Shen, Xinlei He, Michael Backes, Savvas Zannettou, and Yang Zhang.Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models, 2023.
  • Radford etal. [2021]Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever.Learning transferable visual models from natural language supervision.In International Conference on Machine Learning, 2021.
  • Ramesh etal. [2021]Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.Zero-shot text-to-image generation.In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, pages 8821–8831. PMLR, 2021.
  • Ramesh etal. [2022]Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen.Hierarchical text-conditional image generation with clip latents, 2022.
  • Rando etal. [2022]Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, and Florian Tramèr.Red-teaming the stable diffusion safety filter, 2022.
  • Rombach etal. [2022]Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.High-resolution image synthesis with latent diffusion models.In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10674–10685. IEEE, 2022.
  • Ronneberger etal. [2015]Olaf Ronneberger, Philipp Fischer, and Thomas Brox.U-net: Convolutional networks for biomedical image segmentation, 2015.
  • Schramowski etal. [2022]Patrick Schramowski, Christopher Tauchmann, and Kristian Kersting.Can machines help us answering question 16 in datasheets, and in turn reflecting on inappropriate content?In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT), 2022.
  • Schramowski etal. [2023]Patrick Schramowski, Manuel Brack, Björn Deiseroth, and Kristian Kersting.Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models, 2023.
  • Song etal. [2020]Jiaming Song, Chenlin Meng, and Stefano Ermon.Denoising diffusion implicit models.CoRR, abs/2010.02502, 2020.
  • Tsai etal. [2024]Yu-Lin Tsai, Chia-Yi Hsu, Chulin Xie, Chih-Hsun Lin, Jia-You Chen, Bo Li, Pin-Yu Chen, Chia-Mu Yu, and Chun-Ying Huang.Ring-a-bell! how reliable are concept removal methods for diffusion models?, 2024.
  • Vaswani etal. [2017]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanN. Gomez, Lukasz Kaiser, and Illia Polosukhin.Attention is all you need.CoRR, abs/1706.03762, 2017.
  • Yang etal. [2024a]Dingcheng Yang, Yang Bai, Xiaojun Jia, Yang Liu, Xiaochun Cao, and Wenjian Yu.On the multi-modal vulnerability of diffusion models.In Trustworthy Multi-modal Foundation Models and AI Agents (TiFA), 2024a.
  • Yang etal. [2024b]Yijun Yang, Ruiyuan Gao, Xiaosen Wang, Tsung-Yi Ho, Nan Xu, and Qiang Xu.Mma-diffusion: Multimodal attack on diffusion models.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7737–7746, 2024b.
  • Yang etal. [2024c]Yijun Yang, Ruiyuan Gao, Xiao Yang, Jianyuan Zhong, and Qiang Xu.Guardt2i: Defending text-to-image models from adversarial prompts, 2024c.
  • Yang etal. [2024d]Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, and Yinzhi Cao.Sneakyprompt: Jailbreaking text-to-image generative models.In Proceedings of the IEEE Symposium on Security and Privacy, 2024d.
  • Yu etal. [2021]Jiahui Yu, Xin Li, JingYu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu.Vector-quantized image modeling with improved vqgan.ArXiv, abs/2110.04627, 2021.
  • Zhang etal. [2024]Yimeng Zhang, Jinghan Jia, Xin Chen, Aochuan Chen, Yihua Zhang, Jiancheng Liu, Ke Ding, and Sijia Liu.To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images … for now, 2024.
  • Zhao etal. [2025]Xin Zhao, Xiaojun Chen, Xudong Chen, He Li, Tingyu Fan, and Zhendong Zhao.Cipherdm: Secure three-party inference for diffusion model sampling.In Computer Vision – ECCV 2024, pages 288–305, Cham, 2025. Springer Nature Switzerland.
  • Zhuang etal. [2023]Haomin Zhuang, Yihua Zhang, and Sijia Liu.A pilot study of query-free adversarial attack against stable diffusion.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 2385–2392, 2023.
Antelope: Potent and Concealed Jailbreak Attack Strategy (2025)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Reed Wilderman

Last Updated:

Views: 6023

Rating: 4.1 / 5 (72 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Reed Wilderman

Birthday: 1992-06-14

Address: 998 Estell Village, Lake Oscarberg, SD 48713-6877

Phone: +21813267449721

Job: Technology Engineer

Hobby: Swimming, Do it yourself, Beekeeping, Lapidary, Cosplaying, Hiking, Graffiti

Introduction: My name is Reed Wilderman, I am a faithful, bright, lucky, adventurous, lively, rich, vast person who loves writing and wants to share my knowledge and understanding with you.