Reasoning-Benchmark

Reasoning benchmarks probe LLMs’ structured thought across multiple domains—mathematics, coding, commonsense, long-context comprehension, formal logic, hierarchical planning, and miscellaneous symbolic tasks.

1. Mathematics Evaluation

Structured math problems scaling from elementary school through Olympiad level.

Primary School

Math23K – foundational math problems in Chinese.
MathQA – large-scale dataset of math word problems.
ASDIV – diverse English math word problem.
GSM8K – grade-school math benchmark.

High School / University

MathVista – advanced pre-university problems.
ARB – college-level algebra.
MATH – competition math with difficulty tiers.

Olympiad

OmniMath – international math Olympiad tasks.
OlympiadBench – high-difficulty contest problems.
FrontierMath – expert-curated frontier math.

2. Code Evaluation

Benchmarks for LLM code understanding and generation, covering language coverage and software‐lifecycle tasks.

Python & Generation

APPS – LeetCode‐sourced Python tasks.
MathQA-Python – Python word-problem dataset.
DS-1000 – 1,000 data-structure challenges.
SWE-Bench – Software engineering task suite.
BigCodeBench – Large-scale code synthesis tests.
EffiBench – Efficiency-focused code tasks.

Multi-lingual

Multiple – Cross-language code challenges.
CodexEval – Multilingual Codex evaluation.
CodeScope – Scope-varied code tasks.
MCEval – Multi-language benchmark suite.
LiveCodeBench – Time-aware contamination control.
HumanEval – Hand-written Python function tests.

Software Development

CodeXGlue – Full-lifecycle code tasks.
CodeEditorBench – IDE-style editing operations.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reasoning-Benchmark

1. Mathematics Evaluation

Primary School

High School / University

Olympiad

2. Code Evaluation

Python & Generation

Multi-lingual

Software Development

About

Uh oh!

Releases

Packages

ALEX-nlp/Reasoning-Benchmark

Folders and files

Latest commit

History

Repository files navigation

Reasoning-Benchmark

1. Mathematics Evaluation

Primary School

High School / University

Olympiad

2. Code Evaluation

Python & Generation

Multi-lingual

Software Development

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages