Benchmark of core capabilities

Knowledge-Benchmarks

This page organizes LLM knowledge-evaluation benchmarks into four major types—Breadth, Depth, Truthfulness, and Dynamic/Timely—and lists representative datasets for each. We’ve also added a handful of the latest 2024–2025 works.

Reasoning-Benchmarks

Reasoning benchmarks probe LLMs’ structured thought across multiple domains—mathematics, coding, commonsense, long-context comprehension, formal logic, hierarchical planning, and miscellaneous symbolic tasks.

Instruction-Following-Benchmarks

Instruction-following benchmarks have evolved from single-task NLP sets to rich, real-world, and automated evaluations. Early datasets focused on mapping input to output on held-out tasks, giving way to instruction-tuning collections and prompt generalization. Modern evaluations incorporate human prompts, automated judges, style-control, and constraint-based tests. We also highlight recent benchmarks targeting specialized domains, evaluator robustness, and long-context stability.

Safety-Benchmarks

Safety evaluation benchmarks ensure LLMs avoid harmful, unethical, or biased outputs by testing across four directions: Content Safety, Multi‐Dimensional Trustworthiness, Adversarial Robustness, and Agentic Safety.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
README.md		README.md
taxonomy.png		taxonomy.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Benchmark of core capabilities

Knowledge-Benchmarks

Reasoning-Benchmarks

Instruction-Following-Benchmarks

Safety-Benchmarks

About

Uh oh!

Releases

Packages

ALEX-nlp/Benchmark-of-core-capabilities

Folders and files

Latest commit

History

Repository files navigation

Benchmark of core capabilities

Knowledge-Benchmarks

Reasoning-Benchmarks

Instruction-Following-Benchmarks

Safety-Benchmarks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages