This repository contains the code for our paper Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions. We show that simple interactions such as multi-step, multilingual querying can elicit sufficiently harmful jailbreaks from LLMs. We design a metric (HarmScore) to measure the actionability and informativeness of jailbreak responses, and a straightforward attack method (Speak Easy) that significantly increases the success of these exploits across multiple benchmarks.
# Primary libraries
torch
numpy
openai
transformers
# Utility libraries
tqdm
pyyaml
python-box
# Additional dependencies
ray
vllm
ollama
Please refer to requirements.txt
for a full list of libraries.
All data files to be evaluated must adhere to the following format:
[
{
"query": "malicious query",
"response": "response to query"
},
...
]
You can evaluate files containing query-response pairs by running:
export CUDA_VISIBLE_DEVICES=# Specify GPUs
python score_qa_pairs.py --data-dir "path to data" --save-dir "save location" --scorer "scorer to use"
The above creates a new file at save_dir
that contains the same query-response pairs as the original data, with an added score key representing the evaluator’s assigned score.
We have provided an example data file under data/sample_data.json
.
Any questions related to the code or the paper can be directed to Yik Siu Chan (yik_siu_chan@brown.edu) or Narutatsu Ri (nr3764@princeton.edu). If you encounter any problems when using the code or want to report a bug, please open an issue.
Please cite our paper if you find our repository helpful in your work:
@inproceedings{chan2025speakeasy,
title={Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions},
author={Yik Siu Chan and Narutatsu Ri and Yuxin Xiao and Marzyeh Ghassemi},
year={2025},
url={https://arxiv.org/abs/2502.04322},
booktitle={Forty-second International Conference on Machine Learning}
}