Double Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning

Official PyTorch implementation of Double Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning

Xin Xu*, Tianhao Chen*, Fan Zhang, Wanlong Liu, Pengxiang Li, Ajay Kumar Jaiswal, Yuchen Yan, Jishan Hu, Yang Wang, Hao Chen, Shiwei Liu, Shizhe Diao, Can Yang†, Lu Yin† * Equal contribution. † Corresponding author.

The Hong Kong University of Science and Technology, University of Electronic Science and Technology of China, Dalian University of Technology, University of Texas at Austin, Zhejiang University, University of Oxford, NVIDIA, University of Surrey.

Abstract

While slow-thinking large language models (LLMs) exhibit reflection-like reasoning, commonly referred to as the “aha moment”, their ability to generate informative critiques and refine prior solutions remains limited. In this paper, we introduce Double-Checker, a principled framework designed to enhance the reasoning capabilities of slow-thinking LLMs by fostering explicit self-critique and iterative refinement of their previous solutions. By fine-tuning on our curated 1,730 self-critical instances, Double-Checker empowers long-CoT LLMs to iteratively critique and refine their outputs during inference until they evaluate their solutions as correct under self-generated critiques. We validate the efficacy of Double-Checker across a comprehensive suite of reasoning benchmarks, demonstrating that iterative self-critique significantly enhances the reasoning capabilities of long-CoT LLMs. Notably, our Double-Checker increases the pass@1 performance on challenging AIME benchmarks from 4.4% to 18.2% compared to the original long-CoT LLMs. These results highlight a promising direction for developing more trustworthy and effective LLMs capable of structured self-critique.

Results

Installation

First, install torch 2.6.0 + cuda12.4 and vllm=0.8.5.post1

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install vllm==0.8.5.post1

Then, install llama_factorcy:

git clone https://github.com/XinXU-USTC/DoubleChecker.git
cd DoubleChecker/train/LLaMA-Factory/
pip install -e .

Usage

Training

To train DeepSeek-Distill-7B model, please first revise train/LLaMA-Factory//scripts/double-checker-ds-qwen7b.yaml accordingly and then run the following command:

llamafactory-cli train ../scripts/double-checker-ds-qwen7b.yaml

Benchmark Evaluation

Data Preparation

All benchmark datasets for evaluation should be placed in the ./data directory.

To add a new test dataset, follow the format of existing benchmarks in the ./data directory.

Prompt Configuration

For mathematical problems, we use the Qwen-instruct template:

system_prompt = "Please reason step by step, and put your final answer within \\boxed{}."

few_shot_prompt = ""

question_format = """{question}"""

When adding a new mathematics benchmark, you can directly copy the above content to the corresponding ./prompts/qwen-instruct/xxx.py file.

Running Evaluation

You can run evaluation on "math" dataset using "qwen-instruct" prompt with greedy decoding, 32K max tokens, and 1 round of critique using four GPUs using the following command:

CUDA_VISIBLE_DEVICES='0,1,2,3' \
python eval.py \
--model_name_or_path "/path/to/model/weights" \  # Path to model weights
--data_name "math" \  # Benchmark name (corresponding to first-level directory in ./data)
--prompt_type "qwen-instruct" \  # Default chat template
--temperature 0.0 \  # Sampling temperature
--start_idx 0 \  # Starting index for evaluation data
--end_idx -1 \  # Ending index for evaluation data
--n_sampling 1 \  # Number of samples per question
--k 1 \  # k value for unbiased pass@k calculation
--split "test" \  # Benchmark subset partition
--max_tokens 32768 \  # Maximum output length
--seed 0 \  # Random seed
--top_p 1 \  # Top-p sampling parameter
--surround_with_messages \  # Enable this flag if using chat template
--num_rounds 1 # number of rounds to self-critique

Acknowledgments

Our evaluation code is modified from Qwen2.5-Math. We thank their team for their valuable contributions to the community.

Citation

if you find this repo is helpful, please cite

@article{xu2025double,
  title={Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning},
  author={Xu, Xin and Chen, Tianhao and Zhang, Fan and Liu, Wanlong and Li, Pengxiang and Jaiswal, Ajay Kumar and Yan, Yuchen and Hu, Jishan and Wang, Yang and Chen, Hao and others},
  journal={arXiv preprint arXiv:2506.21285},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
images		images
test		test
train		train
.gitattributes		.gitattributes
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Double Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning

Table of contents

Abstract

Results

Installation

Usage

Training

Benchmark Evaluation

Data Preparation

Prompt Configuration

Running Evaluation

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

XinXU-USTC/DoubleChecker

Folders and files

Latest commit

History

Repository files navigation

Double Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning

Table of contents

Abstract

Results

Installation

Usage

Training

Benchmark Evaluation

Data Preparation

Prompt Configuration

Running Evaluation

Acknowledgments

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages