Awesome-RL-based-LLM-Reasoning

We have witnessed the powerful capabilities of pure RL-based LLM Reasoning. In this repository, we will add newest papers, slides, and other interesting materials that enhance LLM reasoning with reinforcement learning, helping everyone learn quickly!
Starring this repository is like being at the forefront of RL-based LLM reasoning.
在风口浪尖 (In the teeth of the storm)

Why ?

Why do we need reasoning?
Why do we use reinforcement learning to get reasoning ability? (What are the advantages compared to reasoning methods that do not use reinforcement learning?)

Papers

Question about LLM Reasoning Ability

[2504] A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility
[2504] Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems? (ByteDance Seed) (by changing one phrase in the condition, top models such as OpenAI-o1 and DeepSeek-R1 can suffer 60% performance loss on elementary school-level arithmetic and reasoning problems)
[2503] Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad (the outcome of the 2025 USA Math Olympiad was worse. Most of the results were 0)

Other Newest Interesting Papers about LLM Reasoning

[2504] DeepSeek-R1 Thoughtology: Let's about LLM Reasoning
[2504] Reasoning Models Know When They’re Right: Probing Hidden States for Self-Verification
[2503] Efficient Test-Time Scaling via Self-Calibration
[2504] Do Larger Language Models Imply Better Reasoning? A Pretraining Scaling Law for Reasoning
[2503] Effectively Controlling Reasoning Models through Thinking Intervention
[2503] Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models (visulize reasoning process)
[2503] Efficient Test-Time Scaling via Self-Calibration (WUSTL) (LLMs are known to be overconfident and provide unreliable confidence estimation)
[2503] Interpreting the Repeated Token Phenomenon in Large Language Models (DeepMind)
[2503] Attentive Reasoning Queries: A Systematic Method for Optimizing Instruction-Following in Large Language Models (Emcie Co Ltd)
[2502] Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?
[2502] When More is Less: Understanding Chain-of-Thought Length in LLMs (I think is also about overthinking) (PKU, MIT)
[2502] Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning (Meta-Yuandong Tian)
[2502] CoT-Valve: Length-Compressible Chain-of-Thought Tuning (overthinking) (NUS)
[2502] The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks (I think overthinking is a practical problem, interesting!) (Berkeley)
[2502] ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates (Princeton)
[2502] Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach (Current approaches to improving LM capabilities rely heavily on increasing model size or specialized prompting) (Max-Plank)
[2502] LIMO: Less is More for Reasoning (LIMO offers a more principled and direct path through explicit trajectory design obtaining complex reasoning ability) (SJTU)
[2502] Confidence Improves Self-Consistency in LLMs (the quality of LLM outputs) (Google Reasearch)
[2502] LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! (UC Berkeley)
[2502] BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation (Salesforce AI Research)
[2502] LLMs Can Teach Themselves to Better Predict the Future (self-play generate data) (LSE)
[2501] Reasoning Language Models: A Blueprint
[2501] s1: Simple test-time scaling (Stanford) (distillation and using 'wait' append response)
[2412] Formal Mathematical Reasoning: A New Frontier in AI
[2412] Efficiently Serving LLM Reasoning Programs with Certaindex (UCSD) (overthinking, probe in the middle)
[2412] Training Large Language Model to Reason in a Continuous Latent Space (Meta-Yuandong Tian)
[2412] Scaling of search and learning: A roadmap to reproduce o1 from reinforcement learning perspective
[2408] Visual Agents as Fast and Slow Thinkers

Surveys

[2504] An Illusion of Progress? Assessing the Current State of Web Agents (OSU, Berkeley)
[2503] Agentic Large Language Models, a survey (LeidenU)
[2503] Harnessing the Reasoning Economy A Survey of Efficient Reasoning for Large Language Models (CUHK)
[2503] What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models (CityU)
[2503] A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond
[2503] Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
[2502] From System 1 to System 2: A Survey of Reasoning Large Language Models
[2407] A comprehensive survey of LLM alignment techniques: RLHF, RLAIF, PPO, DPO and more

Slides and Discussion

Self-improvement of LLM agents through Reinforcement Learning at Scale
A Visual Guide to Reasoning LLMs
Understanding Reasoning LLMs Methods and Strategies for Building and Refining Reasoning Models
What is the difference between large reasoning model and LLM? (Zhihu)
LLM Reasoning: Key Ideas and Limitations Denny Zhou-DeepMind (Video)
Towards Reasoning in Large Language Models Jie Huang-UIUC
Can LLMs Reason & Plan? Subbarao Kambhampati-ASU
Inference-Time Techniques for LLM Reasoning Xinyun Chen-DeepMind
Chain-of-Thought Reasoning In Language Models Zhuosheng Zhang-SJTU
Learning to Self-Improve & Reason with LLMs Jason Weston-Meta & NYU
为什么在Deepseek-R1-ZERO出现前，无人尝试放弃微调对齐，通过强化学习生成思考链推理模型？ Zhihu
Kimi Flood Sung Zhihu
Deepseek系列文章梳理 Zhihu
ChatGPT and The Art of Post-Training Stanford-25/02/18

Video

Open-Source Project

TinyZero (4*4090 is enough for 0.5B LLM, but can't observe aha moment)
Open-r1
Logic-RL
Unsloth-GRPO (simplest r1 implementation)
OpenR (An Open Source Framework for Advanced Reasoning)
DeepSeek-RL-Qwen-0.5B-GRPO-gsm8k
deepseek_r1_train

Introduction to Reinforcement Learning

The core essence of reinforcement learning is how an agent determines the next action within an environment to maximize the return; the environment’s role is to provide the state and reward.

Q-learning (Value-based method): A threshold is set, and if the current value is greater than the threshold (epsilon-greddy), a random action is selected; if it is smaller, an action is chosen from the Q-table. Regardless of which method is chosen, the Q-table needs to be updated. After every action, we update the Q-table of the previous state to maximize the return.
REINFORCE (Policy-based method): It’s like playing Mario where every action in a given playthrough is determined by a policy network. After the game ends, we have the reward for each state and can compute the cumulative return (G) for each state. Then, using this computed G, we calculate the loss and update the parameters of the policy network.

X_PO

[2501] (REINFORCE++) A Simple and Efficient Approach for Aligning Large Language Models 6 (REINFORCE++ is more stable in training compared to GRPO and faster than PPO in OpenRLHF report)
[2405] (SimPO) Simple Preference Optimization with a Reference-Free Reward 227
[2402] (KTO) Model Alignment as Prospect Theoretic Optimization 326
[2402] (GRPO) DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models 250
[2305] (DPO) Direct Preference Optimization: Your Language Model is Secretly a Reward Model 2580
[2203] (InstructGPT/PPO+LLM) Training language models to follow instructions with human feedback 12443
[1707] (PPO) Proximal Policy Optimization Algorithms 23934
[1502] (TRPO)Trust Region Policy Optimization 9579
[1706] (RLHF) Deep Reinforcement Learning from Human Preferences 3571

Cloud GPU

Compshare (After registration, there is a quota of 50 yuan, enough to run R1 on unsloth)

Other Interesting RL-based Reasoning Repository

Contributing

Feel free to contribute more papers or other any resources!

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome-RL-based-LLM-Reasoning

Why ?

Papers

Outcome-based Reward Model

Process-based Reward Model

Reinforcement learning

Search algorithms (Monte Carlo Tree Search or Beam Search)

Question about LLM Reasoning Ability

Other Newest Interesting Papers about LLM Reasoning

Surveys

Slides and Discussion

Video

Open-Source Project

Introduction to Reinforcement Learning

X_PO

Cloud GPU

Other Interesting RL-based Reasoning Repository

Contributing

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

bruno686/Awesome-RL-based-LLM-Reasoning

Folders and files

Latest commit

History

Repository files navigation

Awesome-RL-based-LLM-Reasoning

Why ?

Papers

Outcome-based Reward Model

Process-based Reward Model

Reinforcement learning

Search algorithms (Monte Carlo Tree Search or Beam Search)

Question about LLM Reasoning Ability

Other Newest Interesting Papers about LLM Reasoning

Surveys

Slides and Discussion

Video

Open-Source Project

Introduction to Reinforcement Learning

X_PO

Cloud GPU

Other Interesting RL-based Reasoning Repository

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Packages