This repository contains the implementation of the methods described in our research paper "RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity Search". Building upon the foundational insights of Rainbow Teaming and the MAP-Elites algorithm, RainbowPlus introduces key enhancements to the evolutionary quality-diversity (QD) paradigm.
Specifically, RainbowPlus reimagines the archive as a dynamic, multi-individual container that stores diverse high-fitness prompts per cell, analogous to maintaining a population of elite solutions across behavioral niches. This enriched archive enables a broader evolutionary exploration of adversarial strategies.
Furthermore, RainbowPlus employs a comprehensive fitness function that evaluates multiple candidate prompts in parallel using a probabilistic scoring mechanism, replacing traditional pairwise comparisons and enhancing both accuracy and computational efficiency. By integrating these evolutionary principles into its adaptive QD search, RainbowPlus achieves superior attack efficacy and prompt diversity, outperforming both QD-based methods and state-of-the-art red-teaming approaches.
- ๐ State-of-the-Art Performance: Achieves superior results compared to existing methods on HarmBench benchmark
- ๐ Universal Compatibility: Supports both open-source and closed-source LLMs (OpenAI, vLLM)
- โก Computational Efficiency: Completes an evaluation in just 1.45 hours on HarmBench
- ๐ ๏ธ Flexible Configuration: Highly customizable for various experimental settings
RainbowPlus achieves state-of-the-art results compared to other red-teaming methods.
โโโ configs/ # Configuration files
โ โโโ categories/ # Category definitions
โ โโโ styles/ # Style definitions
โ โโโ base.yml # Base configuration
โ โโโ base-openai.yml # Configuration to run LLMs from OpenAI
โ โโโ base-opensource.yml # Configuration to run open-source LLMs
โ โโโ eval.yml # Evaluation configuration
โ
โโโ data/ # Dataset storage
โ
โโโ rainbowplus/ # Core package
โ โโโ configs/ # Configuration utilities
โ โโโ llms/ # LLM integration modules
โ โโโ scores/ # Fitness and similarity functions
โ โโโ archive.py # Archive management
โ โโโ evaluate.py # Current evaluation implementation
โ โโโ evaluate_v0.py # Evaluation implementation from old version
โ โโโ get_scores.py # Metrics extraction utilities
โ โโโ prompts.py # LLM prompt templates
โ โโโ rainbowplus.py # Main implementation
โ โโโ utils.py # Utility functions
โ
โโโ sh/ # Shell scripts
โ โโโ run.sh # All-in-one execution script
โ
โโโ README.md # This documentation
โโโ setup.py # Package installation script
Create and activate a Python virtual environment, then install the required dependencies:
python -m venv venv
source venv/bin/activate
pip install -e .
Required for accessing certain resources from the Hugging Face Hub (e.g., Llama Guard):
export HF_AUTH_TOKEN="YOUR_HF_TOKEN"
Alternatively:
huggingface-cli login --token=YOUR_HF_TOKEN
Required when using OpenAI models:
export OPENAI_API_KEY="YOUR_API_KEY"
RainbowPlus supports two primary LLM integration methods:
Example configuration for Qwen-2.5-7B-Instruct:
target_llm:
type_: vllm
model_kwargs:
model: Qwen/Qwen2.5-7B-Instruct
trust_remote_code: True
max_model_len: 2048
gpu_memory_utilization: 0.5
sampling_params:
temperature: 0.6
top_p: 0.9
max_tokens: 1024
Additional parameters can be specified according to the vLLM model documentation and sampling parameters documentation.
Example configuration for GPT-4o-mini:
target_llm:
type_: openai
model_kwargs:
model: gpt-4o-mini
sampling_params:
temperature: 0.6
top_p: 0.9
max_tokens: 1024
Additional parameters can be specified according to the OpenAI API documentation.
Basic execution with default configuration:
python -m rainbowplus.rainbowplus \
--config_file configs/base.yml \
--num_samples 150 \
--max_iters 400 \
--sim_threshold 0.6 \
--num_mutations 10 \
--fitness_threshold 0.6 \
--log_dir logs-sota \
--dataset ./data/harmbench.json \
--log_interval 50 \
--shuffle True
For customized experiments, you can override target LLM and specific parameters:
python -m rainbowplus.rainbowplus \
--config_file configs/base-opensource.yml \
--num_samples -1 \
--max_iters 400 \
--sim_threshold 0.6 \
--num_mutations 10 \
--fitness_threshold 0.6 \
--log_dir logs-sota \
--dataset ./data/harmbench.json \
--target_llm "TARGET MODEL" \
--log_interval 50 \
--shuffle True
Parameter | Description |
---|---|
target_llm |
Target LLM identifier |
num_samples |
Number of initial seed prompts |
max_iters |
Maximum number of iteration steps |
sim_threshold |
Similarity threshold for prompt mutation |
num_mutations |
Number of prompt mutations per iteration |
fitness_threshold |
Minimum fitness score to add prompt to archive |
log_dir |
Directory for storing logs |
dataset |
Dataset path |
shuffle |
Whether to shuffle seed prompts |
log_interval |
Number of iterations between log saves |
For evaluating multiple models sequentially:
MODEL_IDS="meta-llama/Llama-2-7b-chat-hf lmsys/vicuna-7b-v1.5 baichuan-inc/Baichuan2-7B-Chat Qwen/Qwen-7B-Chat"
for MODEL in $MODEL_IDS; do
python -m rainbowplus.rainbowplus \
--config_file configs/base-opensource.yml \
--num_samples -1 \
--max_iters 400 \
--sim_threshold 0.6 \
--num_mutations 10 \
--fitness_threshold 0.6 \
--log_dir logs-sota \
--dataset ./data/harmbench.json \
--target_llm $MODEL \
--log_interval 50 \
--shuffle True
# Clean cache between model runs
rm -r ~/.cache/huggingface/hub/
done
After running experiments, evaluate the results:
MODEL_IDS="meta-llama/Llama-2-7b-chat-hf" # For multiple models: MODEL_IDS="meta-llama/Llama-2-7b-chat-hf lmsys/vicuna-7b-v1.5"
for MODEL in $MODEL_IDS; do
# Run evaluation
python -m rainbowplus.evaluate \
--config configs/eval.yml \
--log_dir "./logs-sota/$MODEL/harmbench"
# Extract metrics
python rainbowplus/get_scores.py \
--log_dir "./logs-sota/$MODEL/harmbench" \
--keyword "global"
done
Parameter | Description |
---|---|
config |
Path to configuration file |
log_dir |
Directory containing experiment logs |
keyword |
Keyword for global config file name (default: global ), you can ignore this param |
Results are saved in JSON format:
{
"General": 0.79,
"All": 0.8666092943201377
}
Where:
General
: Metrics calculated following standard methodsAll
: Metrics calculated across all generated prompts
For end-to-end execution, use the provided shell script sh/run.sh:
bash sh/run.sh
- Modify common parameters (
log_dir, max_iters, ...
) in line 4-11. - Modify target LLMs in line 56-77 for open-source models.
- Modify target LLMs in line 80-83 for closed-source models.
- Support OpenAI as fitness function
- Deploy via FastAPI
- Support more LLMs
@misc{dang2025rainbowplusenhancingadversarialprompt,
title={RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity Search},
author={Quy-Anh Dang and Chris Ngo and Truong-Son Hy},
year={2025},
eprint={2504.15047},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.15047},
}