Skip to content

ALEX-nlp/Reasoning-Benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 

Repository files navigation

Reasoning-Benchmark

Reasoning benchmarks probe LLMs’ structured thought across multiple domains—mathematics, coding, commonsense, long-context comprehension, formal logic, hierarchical planning, and miscellaneous symbolic tasks.

1. Mathematics Evaluation

Structured math problems scaling from elementary school through Olympiad level.

Primary School

  • Math23K – foundational math problems in Chinese.
  • MathQA – large-scale dataset of math word problems.
  • ASDIV – diverse English math word problem.
  • GSM8K – grade-school math benchmark.

High School / University

  • MathVista – advanced pre-university problems.
  • ARB – college-level algebra.
  • MATH – competition math with difficulty tiers.

Olympiad

  • OmniMath – international math Olympiad tasks.
  • OlympiadBench – high-difficulty contest problems.
  • FrontierMath – expert-curated frontier math.

2. Code Evaluation

Benchmarks for LLM code understanding and generation, covering language coverage and software‐lifecycle tasks.

Python & Generation

  • APPS – LeetCode‐sourced Python tasks.
  • MathQA-Python – Python word-problem dataset.
  • DS-1000 – 1,000 data-structure challenges.
  • SWE-Bench – Software engineering task suite.
  • BigCodeBench – Large-scale code synthesis tests.
  • EffiBench – Efficiency-focused code tasks.

Multi-lingual

  • Multiple – Cross-language code challenges.
  • CodexEval – Multilingual Codex evaluation.
  • CodeScope – Scope-varied code tasks.
  • MCEval – Multi-language benchmark suite.
  • LiveCodeBench – Time-aware contamination control.
  • HumanEval – Hand-written Python function tests.

Software Development

  • CodeXGlue – Full-lifecycle code tasks.
  • CodeEditorBench – IDE-style editing operations.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published