Reasoning benchmarks probe LLMs’ structured thought across multiple domains—mathematics, coding, commonsense, long-context comprehension, formal logic, hierarchical planning, and miscellaneous symbolic tasks.
Structured math problems scaling from elementary school through Olympiad level.
- Math23K – foundational math problems in Chinese.
- MathQA – large-scale dataset of math word problems.
- ASDIV – diverse English math word problem.
- GSM8K – grade-school math benchmark.
- MathVista – advanced pre-university problems.
- ARB – college-level algebra.
- MATH – competition math with difficulty tiers.
- OmniMath – international math Olympiad tasks.
- OlympiadBench – high-difficulty contest problems.
- FrontierMath – expert-curated frontier math.
Benchmarks for LLM code understanding and generation, covering language coverage and software‐lifecycle tasks.
- APPS – LeetCode‐sourced Python tasks.
- MathQA-Python – Python word-problem dataset.
- DS-1000 – 1,000 data-structure challenges.
- SWE-Bench – Software engineering task suite.
- BigCodeBench – Large-scale code synthesis tests.
- EffiBench – Efficiency-focused code tasks.
- Multiple – Cross-language code challenges.
- CodexEval – Multilingual Codex evaluation.
- CodeScope – Scope-varied code tasks.
- MCEval – Multi-language benchmark suite.
- LiveCodeBench – Time-aware contamination control.
- HumanEval – Hand-written Python function tests.
- CodeXGlue – Full-lifecycle code tasks.
- CodeEditorBench – IDE-style editing operations.