Negative-aware Fine-Tuning (NFT): Bridging Supervised Learning and Reinforcement Learning in Math Reasoning

NFT is a pure supervised learning method for improving LLMs' math-reasoning abilities with no external teachers.

As an SL method, NFT outperforms leading RL algorithms like GRPO and DAPO in 7B model experiments and performs similarly to DAPO in 32B settings.
NFT allows directly optimizing LLMs on negative data, thereby significantly outperforming other SL baselines such as Rejective sampling Fine-Tuning (RFT).
NFT is equivalent to GRPO when training is strictly on-policy, despite their entirely different theoretical foundations.

NFT shows self-reflective improvement is not an inherent priority of RL algorithms. Rather, the current gap between SL and RL methods actually stems from their ability to effectively leverage negative data.

Algorithm Overview

NFT bridges reinforcement learning and supervised learning methods through the leverage of negative feedback via supervision.

The NFT pipeline consists of:

Data Collection: LLM generates answers to math questions, split into positive/negative based on correctness
Implicit Negative Policy: Constructs a policy to model negative answers using the same parameters as the positive policy
Policy Optimization: Both positive and negative answers optimize the LLM via supervised learning

Experimental Results

Comparison of NFT-7B with other zero-shot math models in the Qwen series.

NFT performs competitively compared with other algorithms. We report avg@32 for AIME24, AIME25, and AMC23 and avg@1 for others.

Validation accuracy curves showing NFT's ability to leverage negative data for continuous improvement.

Evaluation

Environment setup

We use exactly the same environment configuration as the official DAPO codebase.

pip install git+ssh://git@github.com/volcengine/verl.git@01ef7184821d0d7844796ec0ced17665c1f50673

Benchmarking

Pretrained 7B and 32B models can be found at Huggingface.

We provide the evaluation codebase integrated in the VeRL infra:

Please refer to eval_local_7B.sh and eval_local_32B.sh for evaluation scripts.

Training

Environment setup

We use exactly the same environment configuration as the official DAPO codebase.

pip install git+ssh://git@github.com/volcengine/verl.git@01ef7184821d0d7844796ec0ced17665c1f50673

Datasets

We employ public dataset DAPO-Math-17k for training, and 6 public math benchmarks for validation. Download pre-sorted training and validation data by

bash download_data.sh

Base Model

bash download_model.sh

Starting Experiments

Please see train_7B.sh and train_32B.sh for a running script (one node). Note that we run 7B experiments using 4×8 H100s, and 32B experiments using 16×8 H100s. Please refer to the instruction of VeRL for launching distributed tasks.

Hyperparameter:

neg_weight: The weight of negative data in NFT's objective. Set to 1.0 for default NFT config. Set to 0.0 for RFT by masking out all negative data loss. Set to -1.0 for the DAPO algorithm for comparison.
normalize: Controls the prompt weight in NFT's objective. Set to 0 so that all question data is treated equally. Set to 1 (default) or 2 to prioritize harder questions. normalize=1 matches Dr. GRPO algorithm in on-policy training, while normalize=2 matches standard GRPO.

Acknowledgement

We thank the verl for providing the awesome open-source RL infrastructure.

Citation

If you find our project helpful, please consider citing

@article{chen2025bridging,
      title         = {Bridging Supervised Learning and Reinforcement Learning in Math Reasoning},
      author        = {Huayu Chen, Kaiwen Zheng, Qinsheng Zhang, Ganqu Cui, Yin Cui, Haotian Ye, Tsung-Yi Lin, Ming-Yu Liu, Jun Zhu, Haoxiang Wang},
      journal       = {arXiv preprint arXiv:2505.18116},
      year          = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
config		config
qwen_math_eval_toolkit		qwen_math_eval_toolkit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_VeRL.md		README_VeRL.md
compute_acc.py		compute_acc.py
download_data.sh		download_data.sh
download_model.sh		download_model.sh
dp_actor.py		dp_actor.py
eval_local_32B.sh		eval_local_32B.sh
eval_local_7B.sh		eval_local_7B.sh
experience_maker.py		experience_maker.py
fsdp_workers.py		fsdp_workers.py
main_nft.py		main_nft.py
ray_nft_trainer.py		ray_nft_trainer.py
requirements.txt		requirements.txt
train_32B.sh		train_32B.sh
train_7B.sh		train_7B.sh
verifier.py		verifier.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Negative-aware Fine-Tuning (NFT): Bridging Supervised Learning and Reinforcement Learning in Math Reasoning

Algorithm Overview

Experimental Results

Evaluation

Environment setup

Benchmarking

Training

Environment setup

Datasets

Base Model

Starting Experiments

Acknowledgement

Citation

About

Uh oh!

Contributors 2

Languages

License

NVlabs/NFT

Folders and files

Latest commit

History

Repository files navigation

Negative-aware Fine-Tuning (NFT): Bridging Supervised Learning and Reinforcement Learning in Math Reasoning

Algorithm Overview

Experimental Results

Evaluation

Environment setup

Benchmarking

Training

Environment setup

Datasets

Base Model

Starting Experiments

Acknowledgement

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 2

Languages