Skip to content

ALEX-nlp/Safety-Benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

Safety-Benchmarks

Safety evaluation benchmarks ensure LLMs avoid harmful, unethical, or biased outputs by testing across four directions: Content Safety, Multi‐Dimensional Trustworthiness, Adversarial Robustness, and Agentic Safety.

1. Content Safety

Benchmarks that probe non‐adversarial detection or refusal of toxic/hateful/violent text.

  • ToxiGen: Adversarially generated toxic statements
  • RealToxicityPrompts: Prompt–response pairs from web sources
  • ToxicChat: Chat-based toxicity prompts
  • BeaverTails: Dialogue safety corpus
  • DiaSafety: Real‐world conversation safety
  • FairPrism: Gender & sexuality safety tests
  • SafetyBench: MC questions in English/Chinese :contentReference[oaicite:0]{index=0}
  • WALLEDEVAL: Toolkit covering 35+ safety datasets :contentReference[oaicite:1]{index=1}

2. Multi-Dimensional Trustworthiness

Holistic benchmarks covering toxicity, bias, privacy, ethics, fairness, robustness, etc.

  • DecodingTrust: Eight‐aspect safety suite
  • HELM Safety: Six harm domains with sub‐benchmarks :contentReference[oaicite:2]{index=2}
  • AegisSafety: 13 critical + 9 sparse risk categories
  • SorryBench: 45 fine‐grained refusal categories
  • XSafety: Multilingual, multi‐dimensional tests
  • S-Eval: Auto‐generated adaptive safety tests :contentReference[oaicite:3]{index=3}

3. Adversarial Robustness

Attack–defense datasets evaluating model resistance to prompt exploits.

  • AdvBench: Single‐turn adversarial suffix attacks
  • ForbiddenQuestions: Jailbreak‐style probes
  • AART: Actionable adversarial red teams
  • AdvPromptSet: Prompt‐based adversarial sets
  • AttaQ: Taxonomy‐driven QA attacks
  • CPAD: Comprehensive poisoning & data attacks
  • ALERT: Fine‐grained risk taxonomy for prompts
  • AnthropicRedTeam: Human‐crafted adversarial dialogues
  • AutoDan: Evolutionary red‐teaming pipeline
  • AutoDan-Turbo: Genetic algorithm‐driven prompt evolution

4. Agentic Safety

Benchmarks for LLMs acting as agents in interactive environments.

  • Agent-SafetyBench: 349 environments × 2K tests across 8 risks
  • R-Judge: Safety risk awareness for LLM agents
  • SG-Bench: Multi‐dimensional safety generalization

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published