JailbreakTracer: Explainable Detection of Jailbreaking Prompts in LLMs Using Synthetic Data Generation
Accepted in IEEE Access Journal (Q1)
Overview of the JailbreakTracer Methodology. The methodology comprises five major components: (1) data collection from jailbreak attack research papers and prompt labeling; (2) synthetic toxic prompt generation using a fine-tuned GPT model, followed by attack validation via LLMs; (3) data preprocessing; (4) training of a transformer-based classifier with explainability provided via LIME; and (5) performance evaluation.
Codes for toxic prompt classification can be found in G1 directory.
Codes for forbidden question reasoning can be found in G2 directory.
Download the Fine-Tuned GPT Model to generate synthetic toxic/jailbreaking prompts.
Download the JailBreakBERT Model and the JailBreakRoBERTa Model to classify the prompt whether it is regular or toxic.
Download the Forbidden Question Classifier to understand the reason why certain questions are flagged as inappropriate, sensitive, or restricted based on predefined rules and ethical considerations.
Method | Accuracy | ASR |
---|---|---|
AutoDefense Zeng et al., 2024 | 92.91% | 55.74% |
Llama Guard Inan et al., 2023 | 94.5% | 37.32% |
LLM Self Defense Phute et al., 2023 | 77% | - |
SMOOTHLLM Robey et al., 2023 | - | 92% |
Prompt Adversarial Tuning Mo et al., 2024 | - | 0.8% |
Heuristic-based Chu et al., 2024 | - | 85.0% |
AutoDAN Liu et al., 2023 | - | 70% |
Generation Exploitation Chu et al., 2024 | - | 68% |
DrAttack Li et al., 2024 | - | 62% |
JailbreakTracer (Ours) | 97.25% | 91.9% |
@ARTICLE{11036671, author={Sayeedi, Md. Faiyaz Abdullah and Bin Hossain, Maaz and Hassan, Md. Kamrul and Afrin, Sabrina and Hossain, Molla Md. Sabit and Hossain, Md. Shohrab}, journal={IEEE Access}, title={JailbreakTracer: Explainable Detection of Jailbreaking Prompts in LLMs Using Synthetic Data Generation}, year={2025}, volume={13}, number={}, pages={123708-123723}, keywords={Ethics;Cognition;Synthetic data;Natural language processing;Artificial intelligence;Adaptation models;Security;Robustness;Prevention and mitigation;Passwords;Natural language processing;large language models;jailbreaking;text classification;synthetic data;generative AI;explainable AI}, doi={10.1109/ACCESS.2025.3579996}}
For any queries, please contact us at msayeedi212049@bscse.uiu.ac.bd