This is the official repository for "Augmented Adversarial Trigger Learning" by Zhe Wang and Yanjun Qi.
The project requires Python 3.8+ and PyTorch. Install the dependencies by running:
pip install -r requirements.txt
The implementation currently supports:
- Llama-2-7b-chat (from HuggingFace hub: meta-llama/Llama-2-7b-chat-hf)
- Vicuna-7b-v1.5 (from HuggingFace hub: lmsys/vicuna-7b-v1.5)
The models will be automatically downloaded when first used. Make sure you have accepted the terms of use for the Llama-2 model on HuggingFace.
You can run attacks using the main script with various parameters:
python main.py \
--llm [llama2/vicuna] \
--q_index [behavior_index] \
--elicit [elicitation_coefficient] \
--softmax [temperature] \
--length [suffix_length] \
--path [output_path]
Parameters:
llm
: Choose between 'llama2' (default) or 'vicuna'q_index
: Index of the behavior to attack from harmful_behaviors.csvelicit
: Elicitation coefficient for optimizationsoftmax
: Temperature parameter for softmaxlength
: Length of the adversarial suffixpath
: Path to save the attack results
The script will:
- Load the specified model and tokenizer
- Initialize the attack with the given parameters
- Run the optimization process
- Save the results including loss values and generated responses
You can run attacks on Llama2 for different harmful categories:
python run_category.py \
--category [bomb/misinformation/hacking/theft/suicide]
A note for hardware: all experiments we run use one or multiple NVIDIA A100 GPUs, which have 40G memory per chip.
If you find this useful in your research, please consider citing:
@inproceedings{wang-qi-2025-augmented,
title = "Augmented Adversarial Trigger Learning",
author = "Wang, Zhe and
Qi, Yanjun",
editor = "Chiruzzo, Luis and
Ritter, Alan and
Wang, Lu",
booktitle = "Findings of the Association for Computational Linguistics: NAACL 2025",
month = apr,
year = "2025",
address = "Albuquerque, New Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-naacl.394/",
doi = "10.18653/v1/2025.findings-naacl.394",
pages = "7068--7100",
ISBN = "979-8-89176-195-7",
}
Our code is based on the GCG's repo (https://github.com/llm-attacks/llm-attacks) and PAIR's repo (https://github.com/patrickrchao/JailbreakingLLMs/tree/main)