Safety evaluation benchmarks ensure LLMs avoid harmful, unethical, or biased outputs by testing across four directions: Content Safety, Multi‐Dimensional Trustworthiness, Adversarial Robustness, and Agentic Safety.
Benchmarks that probe non‐adversarial detection or refusal of toxic/hateful/violent text.
- ToxiGen: Adversarially generated toxic statements
- RealToxicityPrompts: Prompt–response pairs from web sources
- ToxicChat: Chat-based toxicity prompts
- BeaverTails: Dialogue safety corpus
- DiaSafety: Real‐world conversation safety
- FairPrism: Gender & sexuality safety tests
- SafetyBench: MC questions in English/Chinese :contentReference[oaicite:0]{index=0}
- WALLEDEVAL: Toolkit covering 35+ safety datasets :contentReference[oaicite:1]{index=1}
Holistic benchmarks covering toxicity, bias, privacy, ethics, fairness, robustness, etc.
- DecodingTrust: Eight‐aspect safety suite
- HELM Safety: Six harm domains with sub‐benchmarks :contentReference[oaicite:2]{index=2}
- AegisSafety: 13 critical + 9 sparse risk categories
- SorryBench: 45 fine‐grained refusal categories
- XSafety: Multilingual, multi‐dimensional tests
- S-Eval: Auto‐generated adaptive safety tests :contentReference[oaicite:3]{index=3}
Attack–defense datasets evaluating model resistance to prompt exploits.
- AdvBench: Single‐turn adversarial suffix attacks
- ForbiddenQuestions: Jailbreak‐style probes
- AART: Actionable adversarial red teams
- AdvPromptSet: Prompt‐based adversarial sets
- AttaQ: Taxonomy‐driven QA attacks
- CPAD: Comprehensive poisoning & data attacks
- ALERT: Fine‐grained risk taxonomy for prompts
- AnthropicRedTeam: Human‐crafted adversarial dialogues
- AutoDan: Evolutionary red‐teaming pipeline
- AutoDan-Turbo: Genetic algorithm‐driven prompt evolution
Benchmarks for LLMs acting as agents in interactive environments.
- Agent-SafetyBench: 349 environments × 2K tests across 8 risks
- R-Judge: Safety risk awareness for LLM agents
- SG-Bench: Multi‐dimensional safety generalization