JailbreakTracer: Explainable Detection of Jailbreaking Prompts in LLMs Using Synthetic Data Generation

Accepted in IEEE Access Journal (Q1)

Methodology

Overview of the JailbreakTracer Methodology. The methodology comprises five major components: (1) data collection from jailbreak attack research papers and prompt labeling; (2) synthetic toxic prompt generation using a fine-tuned GPT model, followed by attack validation via LLMs; (3) data preprocessing; (4) training of a transformer-based classifier with explainability provided via LIME; and (5) performance evaluation.

Codes

Codes for toxic prompt classification can be found in G1 directory.
Codes for forbidden question reasoning can be found in G2 directory.

Model Weights

Download the Fine-Tuned GPT Model to generate synthetic toxic/jailbreaking prompts.
Download the JailBreakBERT Model and the JailBreakRoBERTa Model to classify the prompt whether it is regular or toxic.
Download the Forbidden Question Classifier to understand the reason why certain questions are flagged as inappropriate, sensitive, or restricted based on predefined rules and ethical considerations.

Result Comparison with Existing Works

Method	Accuracy	ASR
AutoDefense Zeng et al., 2024	92.91%	55.74%
Llama Guard Inan et al., 2023	94.5%	37.32%
LLM Self Defense Phute et al., 2023	77%	-
SMOOTHLLM Robey et al., 2023	-	92%
Prompt Adversarial Tuning Mo et al., 2024	-	0.8%
Heuristic-based Chu et al., 2024	-	85.0%
AutoDAN Liu et al., 2023	-	70%
Generation Exploitation Chu et al., 2024	-	68%
DrAttack Li et al., 2024	-	62%
JailbreakTracer (Ours)	97.25%	91.9%

Cite

@ARTICLE{11036671,
  author={Sayeedi, Md. Faiyaz Abdullah and Bin Hossain, Maaz and Hassan, Md. Kamrul and Afrin, Sabrina and Hossain, Molla Md. Sabit and Hossain, Md. Shohrab},
  journal={IEEE Access}, 
  title={JailbreakTracer: Explainable Detection of Jailbreaking Prompts in LLMs Using Synthetic Data Generation}, 
  year={2025},
  volume={13},
  number={},
  pages={123708-123723},
  keywords={Ethics;Cognition;Synthetic data;Natural language processing;Artificial intelligence;Adaptation models;Security;Robustness;Prevention and mitigation;Passwords;Natural language processing;large language models;jailbreaking;text classification;synthetic data;generative AI;explainable AI},
  doi={10.1109/ACCESS.2025.3579996}}

Contact

For any queries, please contact us at msayeedi212049@bscse.uiu.ac.bd

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
G1		G1
G2		G2
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

JailbreakTracer: Explainable Detection of Jailbreaking Prompts in LLMs Using Synthetic Data Generation

Methodology

Codes

Model Weights

Result Comparison with Existing Works

Cite

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

faiyazabdullah/JailbreakTracer

Folders and files

Latest commit

History

Repository files navigation

JailbreakTracer: Explainable Detection of Jailbreaking Prompts in LLMs Using Synthetic Data Generation

Methodology

Codes

Model Weights

Result Comparison with Existing Works

Cite

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages