JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks

This is the official public repository of the paper JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs.
Be careful! This repository may contain harmful/offensive responses. Users need to use this repository responsibly.

How to use this repository?

Part I: Install and set the ENV

Clone this repository.
Prepare the python environment.

conda create -n CJA python=3.10
conda activate CJA
cd PATH_TO_THE_REPOSITORY
pip install -r requirements.txt

Part II: Label - use our labeling method to label the responses after jailbreak.

Option 1: label single file

Switch directory:

cd ./scripts_label

Command to label single file:

python label.py \
--model_name gpt-4 --test_mode False \
--start_line 0 \
--raw_questions_path "$QUESTIONS" \
--results_path "$file"

$QUESTIONS is the path to the forbidden questions (ideally it should be a .csv file, refer to ./forbidden_questions/forbidden_questions.csv for example).
$file is the path to the LLM responses after jailbreak, it should be a .json file. The .json file could be generated by the following codes.

answers.append({'response': answer})
# Write into the output file
with open(output_file, 'w') as out_file:
    json.dump(answers, out_file, indent=4)

Note that answer is the response from the target LLM suffering jailbreak attacks.

Option 2: label files in a directory
You may also utilize label.sh to label files in a directory:

bash label.sh PATH_TO_RESPONSES_DIRECTORY

The files storing the labels will be saved to the same directory where you store the jailbreak responses. We have omitted the harmful responses related to the project. For example, the few-shot examples in scripts_label/label.py. Feel free to use your own examples.

Optional:
We provide a version that is compatible with the JailbreakBench artifact JSON format. See

./scripts_label/label_jailbreakbench_compatible.py

Part III: Defense - use our defense scripts to detect the jailbreak prompts (adv prompts).

Switch directory:

cd ./scripts_defense

Execute the defense:

bash ./defense_execute.sh DEFENSE_METHOD PATH_TO_YOUR_ADV_PROMPTS_FOLDER

Currently, seven defense methods are supported (refer to ./scripts_defense/defense_execute.sh for details).

The adv prompts folder should follow such a structure:

examples_jailbreak_prompts
└─ adv_basic.json

The .json file could be formatted by the following code:

adv_prompts = [prompt_1, prompt_2, ...] # a list of adv prompts
json_file = OUTPUT_PATH
with open(json_file, 'w') as outfile:
    json.dump(adv_prompts, outfile, indent=4)

Refer to folder ./example_adv_prompts for an example.

Add new results to the leaderboard.

Welcome to submit your own evaluation results (steps = 50) of jailbreak attacks to us.
The submission instructions are available here.
The leaderboard is available here.

TO DO

Check the env file requirements.txt.
Test the guide in the README.md.
Clean the code/comments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks

How to use this repository?

Part I: Install and set the ENV

Part II: Label - use our labeling method to label the responses after jailbreak.

Part III: Defense - use our defense scripts to detect the jailbreak prompts (adv prompts).

Add new results to the leaderboard.

TO DO

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
examples		examples
forbidden_questions		forbidden_questions
leaderboard_data		leaderboard_data
scripts_defense		scripts_defense
scripts_label		scripts_label
README.md		README.md
requirements.txt		requirements.txt

TrustAIRLab/JailbreakRadar

Folders and files

Latest commit

History

Repository files navigation

JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks

How to use this repository?

Part I: Install and set the ENV

Part II: Label - use our labeling method to label the responses after jailbreak.

Part III: Defense - use our defense scripts to detect the jailbreak prompts (adv prompts).

Add new results to the leaderboard.

TO DO

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages