Skip to content

ALEX-nlp/Benchmark-of-core-capabilities

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 

Repository files navigation

Benchmark of core capabilities

Taxonomy

This page organizes LLM knowledge-evaluation benchmarks into four major types—Breadth, Depth, Truthfulness, and Dynamic/Timely—and lists representative datasets for each. We’ve also added a handful of the latest 2024–2025 works.

Reasoning benchmarks probe LLMs’ structured thought across multiple domains—mathematics, coding, commonsense, long-context comprehension, formal logic, hierarchical planning, and miscellaneous symbolic tasks.

Instruction-following benchmarks have evolved from single-task NLP sets to rich, real-world, and automated evaluations. Early datasets focused on mapping input to output on held-out tasks, giving way to instruction-tuning collections and prompt generalization. Modern evaluations incorporate human prompts, automated judges, style-control, and constraint-based tests. We also highlight recent benchmarks targeting specialized domains, evaluator robustness, and long-context stability.

Safety evaluation benchmarks ensure LLMs avoid harmful, unethical, or biased outputs by testing across four directions: Content Safety, Multi‐Dimensional Trustworthiness, Adversarial Robustness, and Agentic Safety.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published