Skip to content
/ JPS Public

[MM '25] JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering

Notifications You must be signed in to change notification settings

thu-coai/JPS

Repository files navigation

JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering

ACM MM 2025 License Python

JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering (ACM MM'25 Accepted)

This repository contains the official implementation of JPS, a novel jailbreak method that combines collaborative visual perturbation and textual steering to enhance the quality and specificity of responses from multimodal large language models (MLLMs). Unlike traditional jailbreak methods that only focus on bypassing safety mechanisms, JPS prioritizes response quality and adherence to user intent while providing detailed, relevant outputs.

πŸš€ Key Features

  • Collaborative Attack Strategy: Combines visual perturbation with textual steering for enhanced effectiveness
  • Quality-Focused Approach: Emphasizes response quality and relevance rather than simple bypassing
  • Strong Generalization: Optimized on only 50 samples from AdvBench-subset, yet transfers effectively to other datasets
  • Multi-Agent Framework: Utilizes revision agents to iteratively improve steering prompts
  • Comprehensive Evaluation: Includes MIFR quality evaluation metrics
  • Batch Processing Support: Efficient configuration management and parallel execution

πŸ”„ Method Overview

JPS Method Overview

Figure: Overview of the JPS (Jailbreak with Collaborative Visual Perturbation and Textual Steering) method framework. The approach combines visual perturbation optimization with multi-agent textual steering to generate high-quality responses from multimodal large language models.

πŸ“‹ Table of Contents

πŸ› οΈ Installation

Requirements

  • Python 3.8+
  • CUDA-compatible GPU
  • At least 24GB GPU memory (recommended)

Setup

  1. Clone the repository:
git clone https://github.com/your-repo/JPS.git
cd JPS
  1. Install dependencies:
pip install -r requirements.txt
  1. Configure paths in config_batch_process/ directory:
cd config_batch_process
# Edit the YAML files to set correct paths
bash config_process.sh

πŸš€ Quick Start

The JPS method works in two phases: Attack and Inference.

Phase 1: Attack (Optimization)

First, optimize steering prompts and adversarial images on AdvBench-subset:

python run_config.py --config_dir './config/advbench_subset' --type attack --gpus 4,5,6,7

Phase 2: Inference (Transfer)

Then apply the optimized components to other datasets:

# Run inference on different datasets
python run_config.py --config_dir './config/advbench' --type inference --gpus 4,5,6,7
python run_config.py --config_dir './config/mmsafetybench' --type inference --gpus 4,5,6,7
python run_config.py --config_dir './config/harmbench' --type inference --gpus 4,5,6,7

Alternative: Single Configuration

For single configuration runs:

# Attack phase
python main.py --config config/path/to/config.yaml --run_type attack

# Inference phase  
python main.py --config config/path/to/config.yaml --run_type inference

πŸ“ Project Structure

JPS/
β”œβ”€β”€ main.py                     # Main entry point
β”œβ”€β”€ run_config.py              # Batch configuration runner
β”œβ”€β”€ run_config.sh              # Example execution script
β”œβ”€β”€ quality_evaluate.py        # MIFR quality evaluation
β”œβ”€β”€ requirements.txt           # Dependencies
β”œβ”€β”€ config/                    # Configuration files
β”‚   β”œβ”€β”€ advbench/             # AdvBench dataset configs
β”‚   β”œβ”€β”€ advbench_subset/      # AdvBench subset configs
β”‚   β”œβ”€β”€ mmsafetybench/        # MMSafetyBench configs
β”‚   └── harmbench/            # HarmBench configs
β”œβ”€β”€ config_batch_process/      # Batch processing utilities
β”‚   β”œβ”€β”€ config_process.sh     # Batch configuration script
β”‚   β”œβ”€β”€ batch_modify_yaml.py  # YAML modification tool
β”‚   └── *.yaml                # Template configurations
β”œβ”€β”€ attack/                    # Attack implementations
β”‚   β”œβ”€β”€ attack_factory.py     # Attack factory
β”‚   β”œβ”€β”€ base_attack.py        # Base attack class
β”‚   β”œβ”€β”€ internvl2_8b_attack.py
β”‚   β”œβ”€β”€ minigpt4_13b_attack.py
β”‚   └── qwen2vl_7b_attack.py
β”œβ”€β”€ model/                     # Model implementations
β”‚   β”œβ”€β”€ model_factory.py      # Model factory
β”‚   β”œβ”€β”€ base_model.py         # Base model class
β”‚   β”œβ”€β”€ internvl2_8b_model.py
β”‚   β”œβ”€β”€ minigpt4_13b_model.py
β”‚   └── qwen2vl_7b_model.py
β”œβ”€β”€ dataset/                   # Dataset loaders
β”‚   β”œβ”€β”€ data_loader_factory.py
β”‚   β”œβ”€β”€ advbench_loader.py
β”‚   β”œβ”€β”€ mmsafetybench_loader.py
β”‚   └── harmbench_loader.py
β”œβ”€β”€ preprocess/                # Image preprocessing
β”‚   β”œβ”€β”€ preprocessor_factory.py
β”‚   └── *_preprocessor.py
β”œβ”€β”€ revision/                  # Multi-agent revision system
β”‚   β”œβ”€β”€ multi_agent_group.py  # Multi-agent coordinator
β”‚   β”œβ”€β”€ single_agent_group.py # Single-agent alternative
β”‚   β”œβ”€β”€ revision_agent.py     # Core revision agent
β”‚   β”œβ”€β”€ prompt_feedback_agent.py
β”‚   β”œβ”€β”€ response_feedback_agent.py
β”‚   └── summarize_agent.py
β”œβ”€β”€ judge/                     # Evaluation judges
β”‚   β”œβ”€β”€ judge_factory.py
β”‚   β”œβ”€β”€ gpt4o_mini_judge.py
β”‚   β”œβ”€β”€ quality_judge.py
β”‚   └── harmbench_judge.py
└── results/                   # Output directory
    β”œβ”€β”€ images/               # Adversarial images
    β”œβ”€β”€ generate/             # Generated responses
    β”œβ”€β”€ eval/                 # Evaluation results
    └── qwen2vl_7b/          # Example: Qwen2-VL-7B optimization results
        β”œβ”€β”€ *_iteration0_.png    # Initial adversarial image
        β”œβ”€β”€ *_iteration1_.png    # Refined adversarial image
        β”œβ”€β”€ *_iteration1_.txt    # First steering prompt
        β”œβ”€β”€ ...                  # Progressive iterations
        β”œβ”€β”€ *_iteration5_.png    # Final adversarial image
        └── *_iteration5_.txt    # Final steering prompt

βš™οΈ Configuration

Batch Configuration Management

JPS provides convenient batch configuration tools:

  1. Navigate to config directory:
cd config_batch_process
  1. Modify configuration templates:

    • Edit model_*.yaml for model settings
    • Edit dataset_*.yaml for dataset paths
    • Edit common.yaml for shared settings
  2. Apply configurations:

bash config_process.sh

This will automatically update all configuration files with your paths and settings.

Key Configuration Parameters

  • attack_config: Visual perturbation settings

    • attack_steps: Number of optimization steps
    • learning_rate: Perturbation learning rate
    • perturbation_constraint: L∞ constraint for perturbations
    • revision_iteration_num: Number of revision iterations
  • model: Target model configuration

    • name: Model identifier (internvl2, qwen2vl, minigpt4)
    • model_path: Path to model weights
    • agent_model_path: Path to revision agent model
  • inference_config: Generation settings

    • batch_size: Inference batch size
    • max_new_tokens: Maximum generated tokens

πŸ“– Usage

Attack Types

JPS supports multiple perturbation strategies:

  • Full: Full-image perturbation
  • Patch: Patch-based perturbation
  • Grid: Grid-based perturbation

Multi-Agent vs Single-Agent

Choose between multi-agent and single-agent revision:

# Multi-agent (default, better performance)
python main.py --config config.yaml --multi_agent 1

# Single-agent (faster, lower memory)
python main.py --config config.yaml --multi_agent 0

Revision Process

The revision process iteratively improves steering prompts through:

  1. Response Analysis: Analyzing model responses to identify weaknesses
  2. Prompt Revision: Using specialized agents to improve steering prompts
  3. Visual Update: Re-optimizing adversarial images with new prompts
  4. Quality Assessment: Evaluating improvements in response quality

πŸ“Š Evaluation

Quality Evaluation

Evaluate response quality using MIFR metrics:

python quality_evaluate.py --input_file results/generate/responses.json

Attack Success Rate (ASR)

ASR is automatically computed during evaluation using configured judges:

  • GPT-4o Mini Judge
  • LlamaGuard3 Judge
  • HarmBench Judge

πŸ€– Supported Models

Currently supported multimodal models:

  • InternVL2-8B: Advanced vision-language model
  • Qwen2-VL-7B: Qwen series multimodal model
  • MiniGPT4-13B: Compact multimodal model

πŸ“š Datasets

Supported evaluation datasets:

  • AdvBench: Standard jailbreak evaluation benchmark
  • AdvBench-Subset: 50-sample optimization subset
  • MMSafetyBench: Multimodal safety evaluation
  • HarmBench: Comprehensive harm evaluation

πŸ“ˆ Results

JPS demonstrates:

  • Superior Quality: Higher MIFR scores compared to existing methods
  • Strong Generalization: Effective transfer from 50 training samples
  • Efficient Optimization: Faster convergence with multi-agent revision
  • Robust Performance: Consistent results across different models and datasets

Example Output Files

We provide example results from Qwen2-VL-7B optimization in results/qwen2vl_7b/:

  • Adversarial Images: full_con32|255_lr1|255_revision5_iterationX_.png - Visual perturbations for each iteration
  • Steering Prompts: full_con32|255_lr1|255_revision5_iterationX_.txt - Textual steering prompts refined through multi-agent revision

The files demonstrate the iterative refinement process:

  • iteration0: Initial adversarial image (no steering prompt yet)
  • iteration1-5: Progressive improvement of both visual perturbations and textual steering
  • Final steering prompts become increasingly sophisticated and effective

These files can be directly used for inference on new questions by:

  1. Using the final adversarial image (iteration5_.png)
  2. Prepending the steering prompt (iteration5_.txt) to your target questions

πŸ”§ Advanced Usage

Custom Datasets

To add a new dataset:

  1. Create a loader in dataset/your_dataset_loader.py
  2. Register in dataset/data_loader_factory.py
  3. Add configuration template in config_batch_process/

Custom Models

To add a new model:

  1. Implement model class in model/your_model.py
  2. Create corresponding attack in attack/your_model_attack.py
  3. Add preprocessor in preprocess/your_model_preprocessor.py
  4. Register in respective factory files

Logging and Monitoring

All experiments are automatically logged with timestamps:

# View logs
ls process_logs/
tail -f process_logs/latest.log

πŸ› Troubleshooting

Common Issues

  1. CUDA Out of Memory: Reduce batch size in configuration
  2. Model Loading Errors: Check model paths in config files
  3. Permission Errors: Ensure write permissions for results directory

Performance Optimization

  • Use multiple GPUs with run_config.py
  • Adjust batch sizes based on available memory
  • Enable mixed precision for faster training

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

🀝 Contributing

We welcome contributions! Please feel free to submit a Pull Request.

⚠️ Disclaimer

This tool is intended for research purposes only. Users are responsible for ensuring ethical and legal compliance when using this software. The authors do not condone malicious use of this technology.


Note: This research contributes to understanding multimodal model vulnerabilities and developing more robust AI safety mechanisms. Please use responsibly.

About

[MM '25] JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published