We have witnessed the powerful capabilities of pure RL-based LLM Reasoning. In this repository, we will add newest papers, slides, and other interesting materials that enhance LLM reasoning with reinforcement learning, helping everyone learn quickly!
Starring this repository is like being at the forefront of RL-based LLM reasoning.
在风口浪尖 (In the teeth of the storm)
- Why do we need reasoning?
- Why do we use reinforcement learning to get reasoning ability? (What are the advantages compared to reasoning methods that do not use reinforcement learning?)
- [2502] Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning (Shanghai AI Lab)
- [2502] Demystifying Long Chain-of-Thought Reasoning in LLMs (Introduced cosine length-scaling reward with repetition penalty for stable CoT length growth) (IN.AI)
- [2501] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training (HKU, Berkeley)
- [2501] Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning (Deepseek)
- [2501] Kimi k1.5: Scaling Reinforcement Learning with LLMs (Kimi)
- [2502] S2 R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning (Tencent)
- [2502] Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling (THU)
- [2502] QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search (UCLA-Yizhou Sun)
- [2312] Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations (PKU & Deepseek)
- [2305] Let's verify step by step (OpenAI)
- [2211] Solving math word problems with process-and outcome-based feedback (DeepMind)
- [2504] Process Reward Models That Think (UMich)
- [2506] RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics (BUAA & BAAI)
- [2504] Learning to Reason under Off-Policy Guidance (Shanghai AI Lab)
- [2504] THINKPRUNE: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning
- [2504] GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning
- [2504] When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning | Explain
- [2503] SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks (agent & reasoning)
- [2503] DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models (short length of thinking by RL)
- [2503] L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning (CMU)
- [2502] Scaling Test-Time Compute Without Verification or RL is Suboptimal (CMU, UC Berkeley) (verifier-based (VB) is better than verifier-free (VF))
- [2502] Reasoning with Reinforced Functional Token Tuning
- [2502] Provably Optimal Distributional RL for LLM Post-Training (Cornell, Harvard)
- [2502] On the Emergence of Thinking in LLMs I: Searching for the Right Intuition (Reinforcement Learning via Self-Play) (MIT)
- [2502] STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving (the scarcity of correct proofs sparse rewards will make performance quickly plateaus. To overcome this, we draw inspiration from mathematicians, who continuously develop new results, partly by proposing novel conjectures or exercises (which are often variants of known results) and attempting to solve them.) (Stanford-Tengyu Ma)
- [2409] Training Language Models to Self-Correct via Reinforcement Learning (DeepMind)
- [2502] Don’t Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls (Tencent)
- [2408] Deepseek-prover-v1. 5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search (DeepSeek)
- [2310] Solving olympiad geometry without human demonstrations (DeepMind)
- [2504] A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility
- [2504] Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems? (ByteDance Seed) (by changing one phrase in the condition, top models such as OpenAI-o1 and DeepSeek-R1 can suffer 60% performance loss on elementary school-level arithmetic and reasoning problems)
- [2503] Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad (the outcome of the 2025 USA Math Olympiad was worse. Most of the results were 0)
- [2504] DeepSeek-R1 Thoughtology: Let's about LLM Reasoning
- [2504] Reasoning Models Know When They’re Right: Probing Hidden States for Self-Verification
- [2503] Efficient Test-Time Scaling via Self-Calibration
- [2504] Do Larger Language Models Imply Better Reasoning? A Pretraining Scaling Law for Reasoning
- [2503] Effectively Controlling Reasoning Models through Thinking Intervention
- [2503] Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models (visulize reasoning process)
- [2503] Efficient Test-Time Scaling via Self-Calibration (WUSTL) (LLMs are known to be overconfident and provide unreliable confidence estimation)
- [2503] Interpreting the Repeated Token Phenomenon in Large Language Models (DeepMind)
- [2503] Attentive Reasoning Queries: A Systematic Method for Optimizing Instruction-Following in Large Language Models (Emcie Co Ltd)
- [2502] Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?
- [2502] When More is Less: Understanding Chain-of-Thought Length in LLMs (I think is also about overthinking) (PKU, MIT)
- [2502] Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning (Meta-Yuandong Tian)
- [2502] CoT-Valve: Length-Compressible Chain-of-Thought Tuning (overthinking) (NUS)
- [2502] The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks (I think overthinking is a practical problem, interesting!) (Berkeley)
- [2502] ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates (Princeton)
- [2502] Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach (Current approaches to improving LM capabilities rely heavily on increasing model size or specialized prompting) (Max-Plank)
- [2502] LIMO: Less is More for Reasoning (LIMO offers a more principled and direct path through explicit trajectory design obtaining complex reasoning ability) (SJTU)
- [2502] Confidence Improves Self-Consistency in LLMs (the quality of LLM outputs) (Google Reasearch)
- [2502] LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! (UC Berkeley)
- [2502] BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation (Salesforce AI Research)
- [2502] LLMs Can Teach Themselves to Better Predict the Future (self-play generate data) (LSE)
- [2501] Reasoning Language Models: A Blueprint
- [2501] s1: Simple test-time scaling (Stanford) (distillation and using 'wait' append response)
- [2412] Formal Mathematical Reasoning: A New Frontier in AI
- [2412] Efficiently Serving LLM Reasoning Programs with Certaindex (UCSD) (overthinking, probe in the middle)
- [2412] Training Large Language Model to Reason in a Continuous Latent Space (Meta-Yuandong Tian)
- [2412] Scaling of search and learning: A roadmap to reproduce o1 from reinforcement learning perspective
- [2408] Visual Agents as Fast and Slow Thinkers
- [2504] An Illusion of Progress? Assessing the Current State of Web Agents (OSU, Berkeley)
- [2503] Agentic Large Language Models, a survey (LeidenU)
- [2503] Harnessing the Reasoning Economy A Survey of Efficient Reasoning for Large Language Models (CUHK)
- [2503] What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models (CityU)
- [2503] A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond
- [2503] Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
- [2502] From System 1 to System 2: A Survey of Reasoning Large Language Models
- [2407] A comprehensive survey of LLM alignment techniques: RLHF, RLAIF, PPO, DPO and more
- Self-improvement of LLM agents through Reinforcement Learning at Scale
- A Visual Guide to Reasoning LLMs
- Understanding Reasoning LLMs Methods and Strategies for Building and Refining Reasoning Models
- What is the difference between large reasoning model and LLM? (Zhihu)
- LLM Reasoning: Key Ideas and Limitations Denny Zhou-DeepMind (Video)
- Towards Reasoning in Large Language Models Jie Huang-UIUC
- Can LLMs Reason & Plan? Subbarao Kambhampati-ASU
- Inference-Time Techniques for LLM Reasoning Xinyun Chen-DeepMind
- Chain-of-Thought Reasoning In Language Models Zhuosheng Zhang-SJTU
- Learning to Self-Improve & Reason with LLMs Jason Weston-Meta & NYU
- 为什么在Deepseek-R1-ZERO出现前,无人尝试放弃微调对齐,通过强化学习生成思考链推理模型? Zhihu
- Kimi Flood Sung Zhihu
- Deepseek系列文章梳理 Zhihu
- ChatGPT and The Art of Post-Training Stanford-25/02/18
- [LLM+RL] R1 论文导读,SFT vs. RL,RL 基础以及 GRPO 细节,以及一系列复现工作讨论
- [LLM+RL] 理解 GRPO 公式原理及 TRL GrpoTrainer 代码实现(advantage 与 loss 计算)
- LLM-Based Reasoning: Opportunities and Pitfalls (LAVA Workshop in ACCV 2024)
- Reinforcement Learning in DeepSeek r1 Visualized (Chinese)
- EZ撸paper: DeepSeek-R1 论文详解 part 3:GPT发展史 | scaling law | 训练范式 | emergent ability
- EZ撸paper: DeepSeek-R1 论文详解 part 2:AGI是什么? | Reinforcement Learning快速入门 | AlphaGo介绍
- EZ撸paper: DeepSeek-R1 论文详解 part 1:比肩 OpenAI-o1,如何做到的?
- [GRPO Explained] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
- DeepSeek R1 Explained to your grandma
TinyZero (4*4090 is enough for 0.5B LLM, but can't observe aha moment)
Open-r1
Logic-RL
Unsloth-GRPO (simplest r1 implementation)
OpenR (An Open Source Framework for Advanced Reasoning)
- DeepSeek-RL-Qwen-0.5B-GRPO-gsm8k
- deepseek_r1_train
The core essence of reinforcement learning is how an agent determines the next action within an environment to maximize the return; the environment’s role is to provide the state and reward.
- Q-learning (Value-based method): A threshold is set, and if the current value is greater than the threshold (epsilon-greddy), a random action is selected; if it is smaller, an action is chosen from the Q-table. Regardless of which method is chosen, the Q-table needs to be updated. After every action, we update the Q-table of the previous state to maximize the return.
- REINFORCE (Policy-based method): It’s like playing Mario where every action in a given playthrough is determined by a policy network. After the game ends, we have the reward for each state and can compute the cumulative return (G) for each state. Then, using this computed G, we calculate the loss and update the parameters of the policy network.
- [2501] (REINFORCE++) A Simple and Efficient Approach for Aligning Large Language Models 6 (REINFORCE++ is more stable in training compared to GRPO and faster than PPO in OpenRLHF report)
- [2405] (SimPO) Simple Preference Optimization with a Reference-Free Reward 227
- [2402] (KTO) Model Alignment as Prospect Theoretic Optimization 326
- [2402] (GRPO) DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models 250
- [2305] (DPO) Direct Preference Optimization: Your Language Model is Secretly a Reward Model 2580
- [2203] (InstructGPT/PPO+LLM) Training language models to follow instructions with human feedback 12443
- [1707] (PPO) Proximal Policy Optimization Algorithms 23934
- [1502] (TRPO)Trust Region Policy Optimization 9579
- [1706] (RLHF) Deep Reinforcement Learning from Human Preferences 3571
- Compshare (After registration, there is a quota of 50 yuan, enough to run R1 on unsloth)
- awesome-llm-reasoning-long2short-papers
- Awesome-Long2short-on-LRMs
- Awesome-Efficient-CoT-Reasoning-Summary
- Awesome RL-based Reasoning MLLMs
- DecryptPrompt (very comprehensive)
- Feel free to contribute more papers or other any resources!