Skip to content

Wu-Jinzhou/TTRL

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

TTRL: Test-Time Reinforcement Learning

Paper Github Wandb Log of AIME

🎉News

  • [2025-05-23] We update both the paper and the code, with the implementation based on the verl.

  • [2025-04-24] We release the code and experimental logs. Check it out: Getting Started.

  • [2025-04-23] We present TTRL (Test-Time Reinforcement Learning), an open-source solution for online RL on data without ground-truth labels, especially test data.

📖Introduction

We investigate Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs). The core challenge of the problem is reward estimation during inference while not having access to ground-truth information. While this setting appears elusive, we find that common practices in Test-Time Scaling (TTS), such as majority voting, yield surprisingly effective rewards suitable for driving RL training.

Performance and settings of TTRL.

Overview of TTRL.

📃Evaluation

Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 211% on AIME 2024 with only unlabeled test data.

Furthermore, although TTRL is only supervised by the maj@n metric, TTRL has demonstrated performance to consistently surpass this upper limit of the initial model, and approach the performance of models trained directly on test data with ground-truth labels.

Overview of TTRL.

✨Getting Started

You can reproduce the results on AIME 2024 with the following commands:

git clone git@github.com:PRIME-RL/TTRL.git
cd verl

pip install -r requirements.txt

bash examples/ttrl/aime.sh

We additionally conducted three independent runs using the preview version of our code. Two of the runs achieved a pass@1 of 43.3, while one run reached 46.7. Please refer to the Weights & Biases logs.

All experiments were conducted on 8 * NVIDIA A100 80GB GPUs.

Pseudo-Code

The implementation of TTRL can be achieved rapidly by simply modifying the reward function. Please refer to the following code snippet for details:

The pseudo-code of the majority voting reward function.

📨Contact

🎈Citation

If you find TTRL helpful, please cite us.

@article{zuo2025ttrl,
  title={Ttrl: Test-time reinforcement learning},
  author={Zuo, Yuxin and Zhang, Kaiyan and Qu, Shang and Sheng, Li and Zhu, Xuekai and Qi, Biqing and Sun, Youbang and Cui, Ganqu and Ding, Ning and Zhou, Bowen},
  journal={arXiv preprint arXiv:2504.16084},
  year={2025}
}

⭐️Star History

Star History Chart

About

TTRL: Test-Time Reinforcement Learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 78.9%
  • Jupyter Notebook 13.4%
  • Shell 7.5%
  • Other 0.2%