FineReason: Evaluating and Improving LLMs’ Deliberate Reasoning through Reflective Puzzle Solving

This is the official repo for our paper "FineReason: Evaluating and Improving LLMs’ Deliberate Reasoning through Reflective Puzzle Solving" in ACL 2025.

📌 FineReason Overview

FINEREASON is a novel logic-puzzle benchmark designed to comprehensively evaluate the reasoning capabilities of LLMs.

Unlike existing benchmarks which focus on final-answer accuracy, FINEREASON delves into intermediate reasoning steps, specifically emphasizing state checking and transition actions, capturing key abilities such as reflection, lookahead, and backtracking—key aspects of human-like System 2 reasoning.
Experiments reveal significant limitations in deep reasoning tasks, even for leading models like Gemini-2.0-Flash-Thinking, highlighting substantial room for improvement.
Training on puzzle-based data enhances model performance in broader reasoning tasks, such as achieving a 5.1% accuracy improvement on GSM8K, demonstrating the potential of puzzle data to boost general reasoning capabilities.

💡 Tree-based Puzzle Decomposition

Puzzle Solving as a Tree: Intermediate states are nodes, state checking/transition are edges, allowing forward exploration and backtracking.
Comprehensive State Capture: We use Depth-first search (DFS) to identify all valid states, validated by executable puzzle rules.
Evaluation Tasks: State checking (solvability prediction) and state transition (next move determination) reveal reasoning processes in LLMs like reflection, correction, and path exploration.

🚀 Quick Start

Environment Setup

conda create -n fine-reason python=3.10 -y
conda activate fine-reason
pip install -r requirements.txt

API Setup

Insert your OpenAI API key into the file openai_key.json.
Insert your Gemini API key into the file gemini_key.json.

Example Usage

To run Sudoku state checking using Gemini-2.0-Flash-Thinking:

python main.py evaluate \
--data_name sudoku_states \
--prompter_name sudoku_state_checking \
--scorer_name state_checking_accuracy \
--model_name gemini_flash_thinking

To run Sudoku state transition using Qwen-2.5-72B-Instruct with a max_output_length of 2048:

python main.py evaluate \
--data_name sudoku_states \
--prompter_name sudoku_state_checking \
--scorer_name state_checking_accuracy \
--model_name qwen \
--path_model Qwen/Qwen2.5-72B-Instruct \
--max_output_length 2048

To run end-to-end evaluation using OpenAI's o1:

python main.py evaluate \
--data_name sudoku \
--prompter_name sudoku_e2e \
--model_name o1

📈 RL Training with Puzzle Data

Our training data is released at HuggingFace.

🔗 Citation

If you find our work helpful, please consider starring this repo and citing our work.

@misc{chen2025finereasonevaluatingimprovingllms,
      title={FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving}, 
      author={Guizhen Chen and Weiwen Xu and Hao Zhang and Hou Pong Chan and Chaoqun Liu and Lidong Bing and Deli Zhao and Anh Tuan Luu and Yu Rong},
      year={2025},
      eprint={2502.20238},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.20238}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
data		data
LICENSE		LICENSE
README.md		README.md
data_loading.py		data_loading.py
game24_tree.py		game24_tree.py
gemini_key.json		gemini_key.json
graphcoloring_tree.py		graphcoloring_tree.py
gridpuzzle_tree.py		gridpuzzle_tree.py
main.py		main.py
modeling.py		modeling.py
openai_key.json		openai_key.json
prompting.py		prompting.py
requirements.txt		requirements.txt
scoring.py		scoring.py
sudoku_tree.py		sudoku_tree.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FineReason: Evaluating and Improving LLMs’ Deliberate Reasoning through Reflective Puzzle Solving

📌 FineReason Overview

💡 Tree-based Puzzle Decomposition

🚀 Quick Start

Environment Setup

API Setup

Example Usage

📈 RL Training with Puzzle Data

🔗 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

DAMO-NLP-SG/FineReason

Folders and files

Latest commit

History

Repository files navigation

FineReason: Evaluating and Improving LLMs’ Deliberate Reasoning through Reflective Puzzle Solving

📌 FineReason Overview

💡 Tree-based Puzzle Decomposition

🚀 Quick Start

Environment Setup

API Setup

Example Usage

📈 RL Training with Puzzle Data

🔗 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages