The code and datasets of our EMNLP 2025 paper "SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models".
This figure illustrates (top) the model’s safety inconsistency, where harmful content is correctly identified yet still successfully bypasses defenses; (middle) our proposed SDGO reinforcement learning framework, which leverages the model’s strong discrimination capabilities to enhance its generation safety without requiring additional annotated data or models, improving safety while maintaining general capabilities; (bottom) the consistency in safety discrimination and generative behaviors exhibited by the LLM after applying SDGO.
You can use the src/revealing_safety_inconsistency/gap_analysis.ipynb script to analyze the safety gaps of any LLMs accessible through API, generating bar charts similar to Figure 1 in our paper, for example:
We use an internal reinforcement learning framework developed by our company for training, so we apologize for not being able to provide the complete training code. However, theoretically, any open-source framework that supports Generative Reward Modeling (GRM) and GRPO can implement SDGO training, such as verl and EasyR1.
We provide SDGO training data under datasets/train, and detailed training parameters are provided in the paper's appendix. What you need to do is simple adaptation operations: including modifying the reward function and model scoring, then you can easily train SDGO. If you encounter any issues during the training process, please feel free to contact dingpeng@smail.nju.edu.cn. We will also reproduce SDGO on open-source frameworks in the future, please stay tuned.
Once you complete SDGO training, you can perform the following evaluations:
- Safety evaluation
- Helpfulness evaluation
- OOD attack evaluation
We use LLaMA-Factory for inference and SFT. Follow these steps to implement all the above evaluations:
-
Install LLaMA-Factory according to the official guide: https://github.com/hiyouga/LLaMA-Factory
-
Put all json files under
data/test/into theLLaMA-Factory/data/directory and register them inLLaMA-Factory/data/dataset_info.json -
Put
sdgo_infer_and_eval.py,sdgo_helpful_eval.py,sdgo_safety_gap.pyfromsrc/andsdgo_run.shfromscripts/into theLLaMA-Factory/directory, then run:bash sdgo_run.sh
You can get all evaluation results and metrics, which will be displayed in the terminal and saved to the corresponding folders in
LLaMA-Factory/.
If you have any questions about our work, please feel free to contact us via the following email:
Peng Ding: dingpeng@smail.nju.edu.cn
Wen Sun: wensun.cs@gmail.com
Dailin Li: ldlbest@mail.dlut.edu.cn
If you find this work useful in your own research, please feel free to leave a star⭐️ and cite our paper:
@article{ding2025sdgo,
title={SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models},
author={Ding, Peng and Sun, Wen and Li, Dailin and Zou, Wei and Wang, Jiaming and Chen, Jiajun and Huang, Shujian},
journal={arXiv preprint arXiv:2508.15648},
year={2025}
}
