JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering
JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering (ACM MM'25 Accepted)
This repository contains the official implementation of JPS, a novel jailbreak method that combines collaborative visual perturbation and textual steering to enhance the quality and specificity of responses from multimodal large language models (MLLMs). Unlike traditional jailbreak methods that only focus on bypassing safety mechanisms, JPS prioritizes response quality and adherence to user intent while providing detailed, relevant outputs.
- Collaborative Attack Strategy: Combines visual perturbation with textual steering for enhanced effectiveness
- Quality-Focused Approach: Emphasizes response quality and relevance rather than simple bypassing
- Strong Generalization: Optimized on only 50 samples from AdvBench-subset, yet transfers effectively to other datasets
- Multi-Agent Framework: Utilizes revision agents to iteratively improve steering prompts
- Comprehensive Evaluation: Includes MIFR quality evaluation metrics
- Batch Processing Support: Efficient configuration management and parallel execution
Figure: Overview of the JPS (Jailbreak with Collaborative Visual Perturbation and Textual Steering) method framework. The approach combines visual perturbation optimization with multi-agent textual steering to generate high-quality responses from multimodal large language models.
- Installation
- Quick Start
- Project Structure
- Configuration
- Usage
- Evaluation
- Supported Models
- Datasets
- Results
- Citation
- License
- Python 3.8+
- CUDA-compatible GPU
- At least 24GB GPU memory (recommended)
- Clone the repository:
git clone https://github.com/your-repo/JPS.git
cd JPS
- Install dependencies:
pip install -r requirements.txt
- Configure paths in
config_batch_process/
directory:
cd config_batch_process
# Edit the YAML files to set correct paths
bash config_process.sh
The JPS method works in two phases: Attack and Inference.
First, optimize steering prompts and adversarial images on AdvBench-subset:
python run_config.py --config_dir './config/advbench_subset' --type attack --gpus 4,5,6,7
Then apply the optimized components to other datasets:
# Run inference on different datasets
python run_config.py --config_dir './config/advbench' --type inference --gpus 4,5,6,7
python run_config.py --config_dir './config/mmsafetybench' --type inference --gpus 4,5,6,7
python run_config.py --config_dir './config/harmbench' --type inference --gpus 4,5,6,7
For single configuration runs:
# Attack phase
python main.py --config config/path/to/config.yaml --run_type attack
# Inference phase
python main.py --config config/path/to/config.yaml --run_type inference
JPS/
βββ main.py # Main entry point
βββ run_config.py # Batch configuration runner
βββ run_config.sh # Example execution script
βββ quality_evaluate.py # MIFR quality evaluation
βββ requirements.txt # Dependencies
βββ config/ # Configuration files
β βββ advbench/ # AdvBench dataset configs
β βββ advbench_subset/ # AdvBench subset configs
β βββ mmsafetybench/ # MMSafetyBench configs
β βββ harmbench/ # HarmBench configs
βββ config_batch_process/ # Batch processing utilities
β βββ config_process.sh # Batch configuration script
β βββ batch_modify_yaml.py # YAML modification tool
β βββ *.yaml # Template configurations
βββ attack/ # Attack implementations
β βββ attack_factory.py # Attack factory
β βββ base_attack.py # Base attack class
β βββ internvl2_8b_attack.py
β βββ minigpt4_13b_attack.py
β βββ qwen2vl_7b_attack.py
βββ model/ # Model implementations
β βββ model_factory.py # Model factory
β βββ base_model.py # Base model class
β βββ internvl2_8b_model.py
β βββ minigpt4_13b_model.py
β βββ qwen2vl_7b_model.py
βββ dataset/ # Dataset loaders
β βββ data_loader_factory.py
β βββ advbench_loader.py
β βββ mmsafetybench_loader.py
β βββ harmbench_loader.py
βββ preprocess/ # Image preprocessing
β βββ preprocessor_factory.py
β βββ *_preprocessor.py
βββ revision/ # Multi-agent revision system
β βββ multi_agent_group.py # Multi-agent coordinator
β βββ single_agent_group.py # Single-agent alternative
β βββ revision_agent.py # Core revision agent
β βββ prompt_feedback_agent.py
β βββ response_feedback_agent.py
β βββ summarize_agent.py
βββ judge/ # Evaluation judges
β βββ judge_factory.py
β βββ gpt4o_mini_judge.py
β βββ quality_judge.py
β βββ harmbench_judge.py
βββ results/ # Output directory
βββ images/ # Adversarial images
βββ generate/ # Generated responses
βββ eval/ # Evaluation results
βββ qwen2vl_7b/ # Example: Qwen2-VL-7B optimization results
βββ *_iteration0_.png # Initial adversarial image
βββ *_iteration1_.png # Refined adversarial image
βββ *_iteration1_.txt # First steering prompt
βββ ... # Progressive iterations
βββ *_iteration5_.png # Final adversarial image
βββ *_iteration5_.txt # Final steering prompt
JPS provides convenient batch configuration tools:
- Navigate to config directory:
cd config_batch_process
-
Modify configuration templates:
- Edit
model_*.yaml
for model settings - Edit
dataset_*.yaml
for dataset paths - Edit
common.yaml
for shared settings
- Edit
-
Apply configurations:
bash config_process.sh
This will automatically update all configuration files with your paths and settings.
-
attack_config: Visual perturbation settings
attack_steps
: Number of optimization stepslearning_rate
: Perturbation learning rateperturbation_constraint
: Lβ constraint for perturbationsrevision_iteration_num
: Number of revision iterations
-
model: Target model configuration
name
: Model identifier (internvl2, qwen2vl, minigpt4)model_path
: Path to model weightsagent_model_path
: Path to revision agent model
-
inference_config: Generation settings
batch_size
: Inference batch sizemax_new_tokens
: Maximum generated tokens
JPS supports multiple perturbation strategies:
- Full: Full-image perturbation
- Patch: Patch-based perturbation
- Grid: Grid-based perturbation
Choose between multi-agent and single-agent revision:
# Multi-agent (default, better performance)
python main.py --config config.yaml --multi_agent 1
# Single-agent (faster, lower memory)
python main.py --config config.yaml --multi_agent 0
The revision process iteratively improves steering prompts through:
- Response Analysis: Analyzing model responses to identify weaknesses
- Prompt Revision: Using specialized agents to improve steering prompts
- Visual Update: Re-optimizing adversarial images with new prompts
- Quality Assessment: Evaluating improvements in response quality
Evaluate response quality using MIFR metrics:
python quality_evaluate.py --input_file results/generate/responses.json
ASR is automatically computed during evaluation using configured judges:
- GPT-4o Mini Judge
- LlamaGuard3 Judge
- HarmBench Judge
Currently supported multimodal models:
- InternVL2-8B: Advanced vision-language model
- Qwen2-VL-7B: Qwen series multimodal model
- MiniGPT4-13B: Compact multimodal model
Supported evaluation datasets:
- AdvBench: Standard jailbreak evaluation benchmark
- AdvBench-Subset: 50-sample optimization subset
- MMSafetyBench: Multimodal safety evaluation
- HarmBench: Comprehensive harm evaluation
JPS demonstrates:
- Superior Quality: Higher MIFR scores compared to existing methods
- Strong Generalization: Effective transfer from 50 training samples
- Efficient Optimization: Faster convergence with multi-agent revision
- Robust Performance: Consistent results across different models and datasets
We provide example results from Qwen2-VL-7B optimization in results/qwen2vl_7b/
:
- Adversarial Images:
full_con32|255_lr1|255_revision5_iterationX_.png
- Visual perturbations for each iteration - Steering Prompts:
full_con32|255_lr1|255_revision5_iterationX_.txt
- Textual steering prompts refined through multi-agent revision
The files demonstrate the iterative refinement process:
iteration0
: Initial adversarial image (no steering prompt yet)iteration1-5
: Progressive improvement of both visual perturbations and textual steering- Final steering prompts become increasingly sophisticated and effective
These files can be directly used for inference on new questions by:
- Using the final adversarial image (
iteration5_.png
) - Prepending the steering prompt (
iteration5_.txt
) to your target questions
To add a new dataset:
- Create a loader in
dataset/your_dataset_loader.py
- Register in
dataset/data_loader_factory.py
- Add configuration template in
config_batch_process/
To add a new model:
- Implement model class in
model/your_model.py
- Create corresponding attack in
attack/your_model_attack.py
- Add preprocessor in
preprocess/your_model_preprocessor.py
- Register in respective factory files
All experiments are automatically logged with timestamps:
# View logs
ls process_logs/
tail -f process_logs/latest.log
- CUDA Out of Memory: Reduce batch size in configuration
- Model Loading Errors: Check model paths in config files
- Permission Errors: Ensure write permissions for results directory
- Use multiple GPUs with
run_config.py
- Adjust batch sizes based on available memory
- Enable mixed precision for faster training
This project is licensed under the MIT License - see the LICENSE file for details.
We welcome contributions! Please feel free to submit a Pull Request.
This tool is intended for research purposes only. Users are responsible for ensuring ethical and legal compliance when using this software. The authors do not condone malicious use of this technology.
Note: This research contributes to understanding multimodal model vulnerabilities and developing more robust AI safety mechanisms. Please use responsibly.