Revisiting CroPA: A Reproducibility Study and Enhancements for Cross-Prompt Adversarial Transferability in Vision-Language Models
This is the official code for the paper Revisiting CroPA
Large Vision-Language Models (VLMs) such as Flamingo, BLIP-2, InstructBLIP, and LLaVA have achieved remarkable performance across tasks like image classification, captioning, and Visual Question Answering (VQA). However, these models remain vulnerable to adversarial attacks, especially when both visual and textual modalities can be manipulated. In “Revisiting CroPA: A Reproducibility Study and Enhancements for Cross-Prompt Adversarial Transferability in Vision-Language Models,” we undertake a comprehensive reproducibility study of CroPA (“Cross-Prompt Attack”), first introduced by Luo et al. (2024), and propose three major enhancements that improve adversarial efficacy and transferability:
- Semantic Initialization via Vision-Encoding Optimization
- Cross-Image Transferability with Universal Perturbations (SCMix & CutMix)
- Value-Vector–Guided Loss for Vision-Encoder Attention (D-UAP Integration)
Our experiments validate CroPA’s original claims and demonstrate that the proposed improvements yield substantial Attack Success Rate (ASR) gains across diverse VLM architectures and tasks.
“Large Vision-Language Models (VLMs) have revolutionized computer vision, enabling tasks such as image classification, captioning, and VQA. However, they remain highly vulnerable to adversarial attacks, especially in scenarios where both visual and textual modalities can be manipulated. We conduct a comprehensive reproducibility study of “An Image is Worth 1000 Lies: Adversarial Transferability Across Prompts on Vision-Language Models” (Luo et al. 2024), validating the Cross-Prompt Attack (CroPA) and confirming its superior cross-prompt transferability compared to existing baselines. Beyond replication, we propose three key enhancements: (1) A novel semantic initialization strategy that significantly improves Attack Success Rate (ASR) and convergence; (2) Investigating cross-image transferability by learning universal perturbations via SCMix and CutMix; (3) A novel value-vector–guided loss function targeting the vision encoder’s attention to improve generalization. Our evaluation across Flamingo, BLIP-2, InstructBLIP, and LLaVA validates the original findings and demonstrates that our improvements consistently boost adversarial effectiveness. Our work reinforces the importance of studying adversarial vulnerabilities in VLMs and provides a more robust framework for generating transferable adversarial examples.”
This section guides you through reproducing all experiments: baseline CroPA, semantic‐init CroPA, cross‐image UAPs, and D-UAP–guided CroPA.
- Operating System: Ubuntu 20.04 LTS (or equivalent Linux distribution)
- Python: ≥ 3.8
- CUDA: ≥ 11.6 (for GPU acceleration)
- NVIDIA GPU: ≥ 40 GB VRAM (e.g., L40S, A100) recommended for large‐scale training.
-
Clone the Repository
git clone https://github.com/Swadesh06/Revisiting-CroPA.git cd Revisiting-CroPA
-
Install Core Requirements
pip install -r requirements.txt
requirements.txt
pins library versions tested in our experiments.
-
Model Weights
-
CLIP ViT-L/14: automatically downloaded via
transformers
oropen_clip_torch
. -
OPT-2.7B & Vicuna-7B/13B:
- For BLIP-2 / InstructBLIP, follow instructions in Salesforce/blip-2 repo.
- For Flamingo / LLaVA, refer to OpenFlamingo and LLaVA respectively.
-
Stable Diffusion XL Checkpoint: Place checkpoint under
checkpoints/sdxl/
.
-
-
Install Dataset
- Navigate to 'dataset' folder and run the following commands to setup the dataset:
wget http://images.cocodataset.org/zips/train2014.zip wget http://images.cocodataset.org/zips/val2014.zip wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Train_mscoco.zip wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Train_mscoco.zip wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Val_mscoco.zip for zip in *.zip; do unzip "$zip"; done rm *.zip
Note: Add the path to your dataset folder in
data\config.json
.
-
Comprehensive Reproducibility Study of CroPA
- Claim 1: CroPA’s cross-prompt transferability across diverse target texts
- Claim 2: CroPA outperforms Single-P and Multi-P baselines when varying the number of prompts
- Claim 3: CroPA converges to higher ASRs as the number of PGD iterations increases (per-task convergence nuance discussed)
- Claim 4: CroPA remains effective under few-shot (in-context) settings (0-shot → 2-shot)
-
Enhancement #1: Semantic Initialization via Vision-Encoding Optimization
- Leverage a diffusion-based approach (e.g., Stable Diffusion XL) to generate a semantically relevant “target image” for a given prompt.
- Initialize adversarial perturbations by minimizing the
$\ell_2$ distance in the vision encoder embedding space between the clean image$x$ and the generated target$x_{\text{target}}$ . - Demonstrate 6–15 % overall ASR improvements on BLIP-2, faster convergence, and higher CLIP–score alignment with target semantics.
-
Enhancement #2: Cross-Image Transferability (SCMix & CutMix)
- Integrate SCMix (self-mixing + cross-mixing) and CutMix augmentations into CroPA’s PGD loop to learn universal perturbations across multiple images.
- Achieve substantial improvements in untargeted ASR across unseen images on Flamingo and BLIP-2, though still highlight inherent challenges in cross-image generalization.
-
Enhancement #3: Value-Vector–Guided Loss for Vision-Encoder Attention (D-UAP Integration)
- Adapt the Doubly-Universal Adversarial Perturbation (D-UAP) concept by aligning perturbations in the value-vector space of the vision encoder’s attention layers, targeting layers most crucial for semantic representation.
- Formulate a joint objective
$L_{\text{re-CroPA}} = L_{\text{CroPA}}(x+\delta_v, t+\delta_t) ;-; \lambda,L_{\text{D-UAP}}(\delta_v)$ , where$L_{\text{D-UAP}} = \sum_{i=1}^N \bigl(1 - \cos\bigl(V_i(x+\delta_v),,V_t\bigr)\bigr),$ with$V_i(\cdot)$ denoting value vectors of the $i$th attention head for an input image, and$V_t$ the target text’s reference image value vectors. - Realize 12.5 % overall ASR gains on BLIP-2 and large improvements in cross-model transferability when source and target share the same vision encoder.
-
Insightful Analyses
- Convergence Behavior: Discuss per-task instability in CroPA’s convergence despite original overall ASR trends (Figure 5 & 6).
- Cross-Model Transferability: D-UAP enhancements yield strong intra–family (BLIP-2 → InstructBLIP) transferability and moderate cross-architecture (BLIP-2 → Flamingo, LLaVA) gains.
- Value-Vector Consistency: Show that value vectors, for a given target text, remain consistent across different input images—highlighting the tight coupling to the vision encoder and implications for transferability.
Vision-Language Models (VLMs) such as BLIP, Flamingo, and LLaVA have shown state-of-the-art performance on a variety of downstream tasks. These models typically fuse frozen vision encoders (e.g., CLIP ViT-L/14) with large language models (e.g., OPT, Vicuna) in one of two architectural paradigms:
-
Fusion-Based Models (e.g., BLIP-2, InstructBLIP): Visual features are extracted via a frozen CLIP encoder, then aligned with textual embeddings via a “querying transformer.” The fused embedding is passed to an LLM (e.g., OPT-2.7B or Vicuna-7B) for text generation. Early fusion ensures persistent visual grounding throughout decoding.
-
Decoder-Only Models (e.g., Flamingo, LLaVA): Visual tokens (frozen CLIP embeddings) are injected late, as special tokens or additional context to a causal LLM (e.g., Vicuna-13B). Visual influence attenuates as decoding progresses.
Adversarial Vulnerabilities in VLMs
- Cross-Model Transferability: Perturbations crafted for one VLM often succeed on different VLMs (Liu et al. 2017, Tramèr et al. 2017).
- Cross-Image Transferability: Universal Adversarial Perturbations (UAPs) generalize across multiple images (Moosavi-Dezfooli et al. 2017, Mopuri et al. 2017).
- Cross-Prompt Transferability: Introduced by Luo et al. (2024), CroPA optimizes image perturbations across multiple textual prompts via a min–max PGD loop, ensuring adversarial effectiveness is maintained across prompts.
Despite CroPA’s novelty, there remained gaps:
- Initialization Sensitivity: CroPA initializes with random noise, which may slow convergence or trap in suboptimal local optima.
- Lack of Cross-Image Generalization: Original CroPA focuses solely on per-image perturbations across prompts; it does not produce universal perturbations for unseen images.
- Vision-Encoder–Agnostic Perturbations: CroPA’s objectives ignore architectural specifics of the vision encoder’s attention, leaving potential structural vulnerabilities unexploited.
Our work addresses these gaps via principled enhancements, yielding stronger, more generalizable adversarial examples.
CroPA Overview (Luo et al. 2024)
-
Objective:
$\min_{\delta_v} \max_{\delta_t},L\bigl(f(x_v + \delta_v,;x_t + \delta_t),;T\bigr)$ -
$f$ : vision-language model -
$x_v$ : input image -
$x_t$ : text prompt -
$T$ : target text (for targeted attacks) -
$\delta_v$ : adversarial perturbation on image,$|\delta_v|_\infty \le \epsilon_v$ -
$\delta_t$ : adversarial (learnable) prompt perturbation on text, updated every$N$ image‐update steps.
-
Algorithm 1: CroPA (PGD-Based)
-
Initialize
$x_v’ \leftarrow x_v$ (image) and for each prompt$x_t^{(i)}$ , initialize$x_t^{(i)\prime} \leftarrow x_t^{(i)}$ . -
For
$\text{step}=1\ldots K$ :- Uniformly sample a prompt
$x_t^{(i)}$ from set${x_t^{(1)},\ldots,x_t^{(k)}}$ . - Compute gradient
$g_v = \nabla_{x_v’},L\bigl(f(x_v’,,x_t^{(i)}),,T\bigr)$ . - Update image:
$x_v’ \leftarrow \text{Clip}_{x_v,,\epsilon_v}\Bigl(x_v’ - \alpha_1,\text{sign}\bigl(g_v\bigr)\Bigr).$ - Every
$N$ steps, compute gradient$g_t = \nabla_{x_t^{(i)\prime}},L\bigl(f(x_v’,,x_t^{(i)\prime}),,T\bigr)$ and update prompt:$x_t^{(i)\prime} \leftarrow x_t^{(i)\prime} + \alpha_2,\text{sign}\bigl(g_t\bigr).$
- Uniformly sample a prompt
-
Return adversarial image
$x_v’$ (discard$\delta_t$ at inference).
Baseline Approaches
-
Single-P: Optimize
$\delta_v$ using a single prompt (Eq. 3). -
Multi-P: Optimize
$\delta_v$ across multiple prompts, summing losses (Eq. 4).
We replicate Luo et al.’s experiments on Flamingo, BLIP-2, InstructBLIP, and LLaVA to validate four main claims:
- Cross-Prompt Transferability (Claim 1)
- Performance vs. Number of Prompts (Claim 2)
- Convergence w.r.t. Iterations (Claim 3)
- Robustness under Few-Shot (0→2-Shot) Context (Claim 4)
Motivation: Random noise initialization is agnostic to target semantics. We hypothesize that semantically informed initialization accelerates convergence and yields stronger perturbations.
-
Diffusion-Based Target Synthesis
- Given target text
$T$ , generate an image$x_{\text{target}} = \mathcal{D}(T, z;,\theta_{\text{SDXL}})$ using a diffusion model (e.g., Stable Diffusion XL). -
$z \sim \mathcal{N}(0, I)$ . -
$x_{\text{target}}$ captures target prompt semantics in pixel space.
- Given target text
-
Vision-Encoder Anchored Perturbation
- Let
$f_v(\cdot)$ be the VLM’s vision encoder. - Solve:
Find δ₍init₎ such that
δ_init = argmin_{||δ||_∞ ≤ ε} ||f_v(x + δ) − f_v(x_target)||²₂
- Typically solved via PGD using
$\ell_2$ loss in feature space until a semantically aligned initialization$\delta_{\text{init}}$ is obtained.
- Let
-
PGD Refinement
-
Starting from
$x_{\text{adv}}^{(0)} = x + \delta_{\text{init}}$ , refine via:x_adv^(t+1) = Π_{B_ε(x)} [ x_adv^(t) - α ∇_{x_adv} ||f_v(x_adv^(t)) - f_v(x_target)||²₂ ]
-
After initialization, we plug this
$x_{\text{adv}}^{(0)}$ into the standard CroPA PGD loop, replacing random noise with semantically informed initialization.
-
Key Benefits
-
Higher Initial CLIP Score between perturbation and
$T$ . - Faster Convergence (fewer iterations needed to reach peak ASR).
- Up to +15 % Overall ASR gain on BLIP-2, with consistent improvements across VQA, Classification, and Captioning (Table 3).
Motivation: CroPA optimizes per-image perturbations across prompts but does not generalize to unseen images. We leverage modern universal-adversarial‐perturbation (UAP) techniques to address this.
-
SCMix (Structured Cross‐Mixing) [Yun et al. 2019; Zhang et al. 2024]
-
Self-Mixing: For an image
$I_1$ , generate$I'_1 = \eta,\text{Resize}(\text{Crop}(I_1)) ;+;(1-\eta),\text{Resize}(\text{Crop}(I_1))$ , with fixed$\eta=0.5$ . -
Cross-Mixing: Given
$I'_1$ and a second image$I_2$ , form$I_3 = \beta_1,I'_1 + \beta_2,I_2$ where$\beta_1 \gg \beta_2$ . - Mixed images force the adversarial generator to learn features invariant to intra‐ and inter‐image variations.
-
Self-Mixing: For an image
-
CutMix [Yun et al. 2019]
- For image pairs
$x_A$ ,$x_B$ , sample a random rectangular mask$M\in{0,1}^{H\times W}$ , then form:$\widetilde{x} = M \odot x_A ;+;(1-M)\odot x_B.$ - Encourages perturbations to target spatially invariant features (e.g., object shapes) rather than pixel‐level artifacts.
- For image pairs
Integration with CroPA
- At each PGD iteration, instead of optimizing on a single clean image
$x$ , we generate a minibatch of mixed images (SCMix or CutMix) and optimize a universal perturbation$\delta_{\text{univ}}$ such that:
min_{||δ||∞ ≤ ε} ∑{i=1}^B L(f(x_i + δ, x_t), target)
where
- Empirically, SCMix tends to yield stronger cross-image ASRs than CutMix, likely due to smoother feature blending.
Outcome
- CroPA + SCMix obtains untargeted ASR ≈ 0.7055 (vs. 0.4355 baseline) on Flamingo (Table 5).
- CroPA + CutMix obtains untargeted ASR ≈ 0.5565.
- Demonstrates that carefully designed augmentations can promote universal perturbations across unseen images, albeit with some degradation relative to per-image CroPA.
Motivation: CroPA’s PGD optimizes against the final language modeling loss, treating the vision encoder as a black box. We hypothesize that directly perturbing the vision encoder’s value vectors in its attention layers (which carry the core visual information) will yield more transferable and semantically aligned perturbations.
-
Doubly-UAP (D-UAP) Primer [Kim et al. 2024]
- In standard UAP, one maximizes:
$\max_{|\delta|\le \epsilon} \frac{1}{|L|},\sum_{\ell\in L} ,\bigl|V_\ell(x) - V_\ell(x+\delta)\bigr|_p,$ where$V_\ell(\cdot)$ are the value vectors at layer$\ell$ .
- In standard UAP, one maximizes:
-
Modified Value-Vector Loss
- For a target text
$T$ , we first generate a reference image$x_{\text{ref}}$ (e.g., via diffusion) and extract its value vectors$V_t = V_\ell(x_{\text{ref}})$ . - Let
$V_i(x+\delta)$ denote the perturbed value vector at the$i^\text{th}$ attention head (at a chosen layer) for input$x$ . - Define:
$L_{\text{D-UAP}}(\delta) ;=; \sum_{i=1}^N \Bigl(,1 ;-; \cos\bigl(V_i(x+\delta),,V_t\bigl)\Bigr),$ where$\cos(\cdot,\cdot)$ is cosine similarity and$N$ is the number of heads. -
Intuition: Encourage
$V_i(x+\delta)$ to align with$V_t$ , effectively “hijacking” the vision encoder’s semantic representation to match the target text’s embedding.
- For a target text
-
Joint Objective (Re-CroPA)
-
Combine CroPA’s adversarial loss
$L_{\text{CroPA}}$ with$L_{\text{D-UAP}}$ :$L_{\text{re-CroPA}} = L_{\text{CroPA}}\bigl(x + \delta_v,;x_t + \delta_t,;T\bigr) ;-; \lambda,L_{\text{D-UAP}}(\delta_v).$ -
$\lambda > 0$ is a hyperparameter balancing cross-prompt transfer versus vision-encoder hijacking. -
At each PGD step:
- Compute
$\nabla_{\delta_v},L_{\text{re-CroPA}}$ . - Update x_v' ← Clip_{x, ε} ( x_v' - α · sign(∇_{δ_v} L_re-CroPA) )
- Every
$N$ steps, update$\delta_t$ as in vanilla CroPA.
- Compute
-
Key Outcomes
-
On BLIP-2 (target text = “bomb”), D-UAP integration yields overall ASR ≈ 0.9685, a +12.5 % improvement over CroPA (Table 6).
-
On OpenFlamingo (same scenario), ASR ≈ 0.9270 vs. CroPA 0.8233 (Table 7).
-
Cross-Model Transfer: D-UAP perturbations trained on BLIP-2 transfer strongly to InstructBLIP (same vision encoder), whereas vanilla CroPA shows limited transfer. When transferring to architectures with different vision encoders (Flamingo, LLaVA), D-UAP still outperforms baseline CroPA but at lower absolute ASRs (Table 8).
-
Architectural Insights:
- Value vectors are “encoder-specific”; adversarial alignment to
$V_t$ is most effective when source and target share the same architecture. - Confirms that disrupting the vision encoder’s core attention representation yields significant gains in adversarial potency and transferability.
- Value vectors are “encoder-specific”; adversarial alignment to
-
Image Subset: A curated set of 50 diverse images (e.g., MSCOCO‐style subset) used for per-image CroPA.
-
Target Text Prompts:
- “unknown,” “I am sorry,” “not sure,” “very good,” “too late,” “metaphor,” “bomb” (harmful target).
- Prompts were chosen to span a range from abstract (“metaphor,” “too late”) to semantically grounded (“bomb,” “very good”).
-
Prompt Categories:
- VQAgeneral: Generic questions (e.g., “What is in the image?”).
- VQAspecific: Specific property queries (e.g., “How many people are wearing red?”).
- Classification: Standard “What does this image depict?”
- Captioning: Describe the scene in a full sentence.
Model | Vision Encoder | Language Model | Architecture |
---|---|---|---|
BLIP-2 | CLIP ViT-L/14 | OPT-2.7B | Fusion-Based |
InstructBLIP | CLIP ViT-L/14 | Vicuna-7B | Fusion-Based |
OpenFlamingo (v1) | CLIP ViT-L/14 | Vicuna-7B | Decoder-Only |
LLaVA | CLIP ViT-L/14 | Vicuna-13B | Decoder-Only |
Note: We used open‐source reimplementations where needed (e.g., OpenFlamingo‐9B) to ensure reproducibility.
-
Cross-Prompt (Baseline CroPA)
- Knowledge: White‐box access to gradients and parameters of the source VLM.
-
Restriction:
$|\delta_v|_\infty \le 16/255$ . -
Evaluation: Measure Targeted‐ASR (exact
$T$ match for targeted) or non‐targeted ASR (any mismatch for untargeted).
-
Cross-Model
- Surrogate (Source): White‐box.
- Target: Black‐box (architectural family known but no gradients).
Source Target Transfer Setting BLIP-2 (OPT) InstructBLIP (Vicuna) Intra‐family (same encoder) BLIP-2 Flamingo Cross‐architecture BLIP-2 LLaVA Cross‐architecture InstructBLIP BLIP-2 Intra‐family InstructBLIP Flamingo Cross‐architecture InstructBLIP LLaVA Cross‐architecture Flamingo BLIP-2 Cross‐architecture Flamingo InstructBLIP Cross‐architecture Flamingo LLaVA Cross‐architecture -
Cross-Image (Universal)
- Knowledge: White‐box on target VLM.
- Data: Limited set of training images + their augmented variants (SCMix/CutMix).
-
Goal: Single perturbation
$\delta_{\text{univ}}$ that generalizes to unseen images.
-
Framework: PyTorch 1.12+, CUDA 11.6+ (tested on NVIDIA L40S - PyTorch Lightning).
-
Diffusion Model (for Semantic Init): Stable Diffusion XL (via CompVis/stable-diffusion or [stability AI’s SDXL shard]).
-
Batch Sizes:
- Per-Image CroPA: Batch size = 1 (single image + multiple prompts).
- Universal (SCMix, CutMix): Batch size = 4 (4 images mixed) for faster perturbation aggregation.
-
Optimization Hyperparameters (default unless specified):
-
$\epsilon=16/255$ (image perturbation budget) -
$\alpha_1=1.0/255$ (image step size) -
$\alpha_2=0.01$ (prompt step size, token‐embedding scale) -
$K=2000$ PGD steps -
$N=10$ (prompt update every 10 image steps) -
$\lambda=0.5$ (value‐vector loss weight)
-
-
SCMix Parameters:
$\eta=0.5$ (fixed),$\beta_1=0.9,;\beta_2=0.1$ . -
CutMix: Random rectangle with area ratio sampled from
$\text{Beta}(1.0,1.0)$ when mixing. -
Evaluation: Compute Targeted‐ASR (exact match) for targeted; Untargeted‐ASR (any mismatch) for cross‐model and cross‐image. Each result is averaged over 50 test images and 6 distinct target texts.
Claim | Our Findings (Flamingo) | Original (Luo et al., 2024) |
---|---|---|
1. Cross-Prompt Transferability | CroPA yields high ASRs (≈ ≥ 0.8–0.98) across “unknown,” “I am sorry,” “not sure,” “very good,” “too late,” “metaphor.” Baseline Single-P & Multi-P under‐perform. (Table 1) | Congruent: CroPA significantly outperforms baselines; transfer independent of semantic frequency. |
2. ASR vs. # Prompts | With 1, 2, 5, 10 prompts, CroPA consistently > Multi-P & Single-P; baselines saturate below CroPA’s curve. (Figure 4) | Congruent: CroPA’s ASR improves with # prompts; baselines cannot match. |
3. Convergence w.r.t. PGD Iterations | Overall ASR increases with iterations (Figure 6), but per‐task ASR exhibits fluctuations (Figure 5), highlighting task‐specific instability. | Mostly Congruent: Luo et al. report smooth overall ASR convergence; per‐task convergence details not extensively discussed. |
4. Few‐Shot (0→2-Shot) | CroPA’s ASR drops moderately when switching to 2‐shot context but remains significantly above Multi-P (Table 2). | Congruent: CroPA > baselines under 2‐shot context though ASR reduces. |
Conclusion: We successfully reproduce all four core CroPA claims, with nuances around per‐task convergence.
Target Text (“bomb”, “unknown”, “I am sorry”, “very good”, “too late”) | Multi-P | CroPA | CroPA + Init |
---|---|---|---|
VQAgeneral | 0.684–0.724 | 0.8176–0.968 | 0.8824–0.9145 |
VQAspecific | 0.826–0.988 | 0.9100–0.988 | 0.8920–0.9524 |
Classification | 0.555–0.9498 | 0.707–0.9498 | 0.8362–0.9270 |
Captioning | 0.285–0.696 | 0.420–0.9412 | 0.7550–0.9716 |
Overall ASR | 0.6095–0.8433 | 0.7708–0.9260 | 0.8761–0.9458 |
- Observation: Semantic initialization yields 6 – 15 % overall ASR improvements on BLIP-2 across all tasks (Table 3 & 4).
- Convergence: CroPA + Init reaches peak ASR in ≈ 1600 steps vs. 2000 steps for Vanilla CroPA (Figure 7a).
- CLIP Score: Perturbations from CroPA + Init achieve higher CLIP similarity to target text (23.09 vs. 21.80 for “bomb”), indicating stronger semantic alignment (Figure 7b).
Method | VQAgeneral | VQAspecific | Classification | Captioning | Overall ASR |
---|---|---|---|---|---|
CroPA (Base) | 0.3500 | 0.4920 | 0.4000 | 0.5000 | 0.4355 |
CroPA + SCMix (UAP) | 0.6560 | 0.7160 | 0.5000 | 0.9500 | 0.7055 |
CroPA + CutMix (UAP) | 0.4520 | 0.7240 | 0.5000 | 0.5500 | 0.5565 |
- Observation: Integrating SCMix yields a +0.27 gain in overall untargeted ASR versus Vanilla CroPA, demonstrating that data augmentation fosters universal perturbations.
- Comparison: CutMix also improves cross-image ASR, though SCMix (structured blending) consistently outperforms CutMix in both speed (convergence at 1500 steps vs. 1700) and magnitude (Table 5).
Method | VQAgeneral | VQAspecific | Classification | Captioning | Overall ASR |
---|---|---|---|---|---|
Multi-P | 0.6840 | 0.8260 | 0.9050 | 0.6090 | 0.7560 |
CroPA | 0.8176 | 0.9100 | 0.9498 | 0.6960 | 0.8433 |
CIA [Yang et al. 2024] | 0.2985 | 0.2281 | 0.4857 | 0.4687 | 0.3700 |
CroPA + D-UAP | 0.9420 | 0.9720 | 0.9790 | 0.9810 | 0.9685 |
Method | VQAgeneral | VQAspecific | Classification | Captioning | Overall ASR |
---|---|---|---|---|---|
Multi-P | 0.6960 | 0.8540 | 0.8990 | 0.6060 | 0.7638 |
CroPA | 0.7900 | 0.8860 | 0.9470 | 0.6700 | 0.8233 |
CIA | 0.3027 | 0.4302 | 0.5112 | 0.5080 | 0.4380 |
CroPA + D-UAP | 0.9000 | 0.9520 | 0.9500 | 0.9060 | 0.9270 |
-
Observation: D-UAP greatly improves ASR across all categories, with Overall ASR ≈ 0.9685 on BLIP-2 (+12.5 %) and 0.9270 on Flamingo (+10.4 %).
-
Cross-Model Transferability (BLIP-2 → InstructBLIP)
- CroPA → ASR ≈ 0.5038 (Overall)
- CroPA + D-UAP → ASR ≈ 0.6850 (Overall)
-
Cross-Architecture (BLIP-2 → Flamingo; BLIP-2 → LLaVA)
- Noticeable gains with D-UAP but lower absolute ASRs (Table 8 in Appendix D.4.5).
-
Insight: Value vectors are highly encoder‐specific; best transfer when source and target share identical vision encoders (e.g., CLIP ViT‐L/14).
-
Per-Task Convergence (Vanilla CroPA): Some tasks (e.g., Classification, Captioning) exhibit oscillatory ASR as PGD iterates, whereas overall ASR trends upward (Figure 5 & 6).
-
Semantic Initialization Ablation: Removing semantic init reverts convergence to vanilla CroPA, requiring more iterations and lower ASR plateau.
-
$\lambda$ Sensitivity (D-UAP):-
$\lambda=0$ → Vanilla CroPA behavior. -
$\lambda\in[0.5,1.0]$ yields best trade-off. Larger$\lambda$ (≥1.0) may degrade cross-prompt transfer due to over‐emphasis on value‐vector alignment.
-
-
SCMix vs. CutMix: SCMix’s two‐stage mixing encourages smoother latent blends, yielding higher cross-image ASR at earlier iterations (Figure in Appendix D.3).
Revisiting-CroPA/
├── data/
│ ├── images/ # Raw images for per-image CroPA
│ ├── scmixed_images/ # Pre-generated SCMix images (optional)
│ └── cutmixed_images/ # Pre-generated CutMix images (optional)
│
├── checkpoints/
│ ├── sdxl/ # Stable Diffusion XL model weights
│ └── cropa_base/ # Pretrained CroPA perturbations (if any)
│
├── src/
│ ├── __init__.py
│ ├── models/
│ │ ├── blip2_model.py # Wrapper for BLIP-2 + prompt‐PGD
│ │ ├── instructblip_model.py # Wrapper for InstructBLIP
│ │ ├── flamingo_model.py # Wrapper for OpenFlamingo
│ │ └── llava_model.py # Wrapper for LLaVA
│ │
│ ├── attacks/
│ │ ├── cropa.py # Vanilla CroPA implementation (Algorithm 1)
│ │ ├── init_cropa.py # CroPA + Semantic Initialization
│ │ ├── uap_sc_mix.py # CroPA + SCMix universal perturbation
│ │ ├── uap_cutmix.py # CroPA + CutMix universal perturbation
│ │ └── cropa_duap.py # CroPA + D-UAP–guided loss (Algorithm 3)
│ │
│ ├── utils/
│ │ ├── data_loader.py # Custom DataLoader for image & prompt sets
│ │ ├── clip_score.py # Compute CLIP similarity between image & text
│ │ ├── vision_encoder_utils.py # Functions to extract value vectors
│ │ └── eval_metrics.py # ASR computation & logging
│ │
│ └── evaluate/
│ ├── eval_cropa.sh # Script to evaluate vanilla CroPA across models/tasks
│ ├── eval_semantic_init.sh # Script to evaluate CroPA + Semantic Init
│ ├── eval_uap_sc.sh # Script to evaluate CroPA + SCMix
│ ├── eval_uap_cm.sh # Script to evaluate CroPA + CutMix
│ └── eval_duap.sh # Script to evaluate CroPA + D-UAP
│
├── results/
│ ├── cropa_baseline/ # Logs & ASR tables for baseline CroPA
│ ├── semantic_init/ # Logs & ASR tables for CroPA + Semantic Init
│ ├── cross_image/ # Logs & ASR tables for SCMix & CutMix experiments
│ └── duap/ # Logs & ASR tables for D-UAP experiments
│
├── requirements.txt
├── README.md # This file
└── LICENSE.md
-
Prepare Prompt List
-
Edit
data/prompts.txt
to include desired target prompts (one per line). -
Example:
unknown I am sorry not sure very good too late metaphor bomb
-
-
Generate Adversarial Examples
The run_algorithm.sh
script automates the execution of your chosen algorithm with specified parameters.
To run an algorithm, use the run_algorithm.sh
script and pass parameters as command-line arguments.
Basic Usage:
./run_algorithm.sh --algorithm=<cropa|duap|init> --model_name=<your_model>
Key Parameters:
--algorithm
: Required. Choosecropa
,duap
, orinit
.--model_name
: Specify the model (e.g.,blip2
,instructblip
).--device
: Set the GPU ID (e.g.,0
) or-1
for auto-detect.--target="your_prompt
: Provide a single target text directly.--prompt_file=/path/to/prompt.txt
: Use this to read multiple target texts from a file (one per line). The script will run the algorithm for each prompt.
Example:
# Run 'cropa' with a specific model and prompts from a file
./run_algorithm.sh --algorithm=cropa --model_name=blip2 --prompt_file=./my_prompts.txt
-
View Logs & Results
- Adversarial examples are saved under
results/cropa_baseline/<model>/
. - ASR tables are generated in the same directory.
- Adversarial examples are saved under
This repository is released under the MIT License. Feel free to use and modify for research purposes.
@article{RevisitingCroPA2025,
title = {Revisiting CroPA: A Reproducibility Study and Enhancements for Cross-Prompt Adversarial Transferability in Vision-Language Models},
author = {Anonymous},
journal = {Under Review at TMLR},
year = {2025},
note = {Under double-blind review}
}
Please cite this paper if you use any part of the experiments, code, or insights provided herein.
We thank the authors of Luo et al. (2024) for making CroPA’s original code available, and the open-source communities behind BLIP-2, InstructBLIP, Flamingo, LLaVA, and Stable Diffusion projects.
For any questions or issues, please open an issue on the GitHub repository.