Skip to content

faiyazabdullah/JailbreakTracer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 

Repository files navigation

JailbreakTracer: Explainable Detection of Jailbreaking Prompts in LLMs Using Synthetic Data Generation

Accepted in IEEE Access Journal (Q1)

📄 [paper] 📊 [dataset]

Methodology

Methodology

Overview of the JailbreakTracer Methodology. The methodology comprises five major components: (1) data collection from jailbreak attack research papers and prompt labeling; (2) synthetic toxic prompt generation using a fine-tuned GPT model, followed by attack validation via LLMs; (3) data preprocessing; (4) training of a transformer-based classifier with explainability provided via LIME; and (5) performance evaluation.

Codes

Codes for toxic prompt classification can be found in G1 directory.
Codes for forbidden question reasoning can be found in G2 directory.

Model Weights

Download the Fine-Tuned GPT Model to generate synthetic toxic/jailbreaking prompts.
Download the JailBreakBERT Model and the JailBreakRoBERTa Model to classify the prompt whether it is regular or toxic.
Download the Forbidden Question Classifier to understand the reason why certain questions are flagged as inappropriate, sensitive, or restricted based on predefined rules and ethical considerations.

Result Comparison with Existing Works

Method Accuracy ASR
AutoDefense Zeng et al., 2024 92.91% 55.74%
Llama Guard Inan et al., 2023 94.5% 37.32%
LLM Self Defense Phute et al., 2023 77% -
SMOOTHLLM Robey et al., 2023 - 92%
Prompt Adversarial Tuning Mo et al., 2024 - 0.8%
Heuristic-based Chu et al., 2024 - 85.0%
AutoDAN Liu et al., 2023 - 70%
Generation Exploitation Chu et al., 2024 - 68%
DrAttack Li et al., 2024 - 62%
JailbreakTracer (Ours) 97.25% 91.9%

Cite

@ARTICLE{11036671,
  author={Sayeedi, Md. Faiyaz Abdullah and Bin Hossain, Maaz and Hassan, Md. Kamrul and Afrin, Sabrina and Hossain, Molla Md. Sabit and Hossain, Md. Shohrab},
  journal={IEEE Access}, 
  title={JailbreakTracer: Explainable Detection of Jailbreaking Prompts in LLMs Using Synthetic Data Generation}, 
  year={2025},
  volume={13},
  number={},
  pages={123708-123723},
  keywords={Ethics;Cognition;Synthetic data;Natural language processing;Artificial intelligence;Adaptation models;Security;Robustness;Prevention and mitigation;Passwords;Natural language processing;large language models;jailbreaking;text classification;synthetic data;generative AI;explainable AI},
  doi={10.1109/ACCESS.2025.3579996}}

Contact

For any queries, please contact us at msayeedi212049@bscse.uiu.ac.bd

About

Explainable Detection of Jailbreaking Prompts in LLMs Using Synthetic Data Generation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages