Skip to content

TrustAIRLab/JailbreakRadar

Repository files navigation

JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks ACL'25

arXiv: available website: online dataset: released

This is the official public repository of the paper JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs.
Be careful! This repository may contain harmful/offensive responses. Users need to use this repository responsibly.

How to use this repository?


Part I: Install and set the ENV

  1. Clone this repository.
  2. Prepare the python environment.
conda create -n CJA python=3.10
conda activate CJA
cd PATH_TO_THE_REPOSITORY
pip install -r requirements.txt

Part II: Label - use our labeling method to label the responses after jailbreak.

Option 1: label single file

  1. Switch directory:
cd ./scripts_label
  1. Command to label single file:
python label.py \
--model_name gpt-4 --test_mode False \
--start_line 0 \
--raw_questions_path "$QUESTIONS" \
--results_path "$file"

$QUESTIONS is the path to the forbidden questions (ideally it should be a .csv file, refer to ./forbidden_questions/forbidden_questions.csv for example).
$file is the path to the LLM responses after jailbreak, it should be a .json file. The .json file could be generated by the following codes.

answers.append({'response': answer})
# Write into the output file
with open(output_file, 'w') as out_file:
    json.dump(answers, out_file, indent=4)

Note that answer is the response from the target LLM suffering jailbreak attacks.

Option 2: label files in a directory
You may also utilize label.sh to label files in a directory:

bash label.sh PATH_TO_RESPONSES_DIRECTORY

The files storing the labels will be saved to the same directory where you store the jailbreak responses. We have omitted the harmful responses related to the project. For example, the few-shot examples in scripts_label/label.py. Feel free to use your own examples.

Optional:
We provide a version that is compatible with the JailbreakBench artifact JSON format. See

./scripts_label/label_jailbreakbench_compatible.py

Part III: Defense - use our defense scripts to detect the jailbreak prompts (adv prompts).

  1. Switch directory:
cd ./scripts_defense
  1. Execute the defense:
bash ./defense_execute.sh DEFENSE_METHOD PATH_TO_YOUR_ADV_PROMPTS_FOLDER

Currently, seven defense methods are supported (refer to ./scripts_defense/defense_execute.sh for details).

The adv prompts folder should follow such a structure:

examples_jailbreak_prompts
└─ adv_basic.json

The .json file could be formatted by the following code:

adv_prompts = [prompt_1, prompt_2, ...] # a list of adv prompts
json_file = OUTPUT_PATH
with open(json_file, 'w') as outfile:
    json.dump(adv_prompts, outfile, indent=4)

Refer to folder ./example_adv_prompts for an example.

Add new results to the leaderboard.

Welcome to submit your own evaluation results (steps = 50) of jailbreak attacks to us.
The submission instructions are available here.
The leaderboard is available here.

TO DO

  • Check the env file requirements.txt.
  • Test the guide in the README.md.
  • Clean the code/comments.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published