Skip to content

A comrephensive collection of learning from rewards in the post-training and test-time scaling of LLMs, with a focus on both reward models and learning strategies across training, inference, and post-inference stages.

License

Notifications You must be signed in to change notification settings

bobxwu/learning-from-rewards-llm-papers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 

Repository files navigation

📚Awesome-Learning-from-Rewards-Papers

stars arXiv LICENSE Contributors

This repository accompanies our survey paper:
Sailing by the Stars: A Survey on Reward Models and Learning Strategies for Learning from Rewards.
We curate a comrephensive collection of learning from rewards in the post-training and test-time scaling of LLMs, with a focus on both reward models and learning strategies across training, inference, and post-inference stages.

We welcome contributions from the community, so feel free to submit pull requests to add missing papers!


Figure 1: Scaling phases of LLMs.



Figure 2: Conceptual framework of learning from rewards.

🎯Training with Rewards

  • Proximal policy optimization algorithms. arXiv, 2017. paper

  • Fine-Tuning Language Models from Human Preferences. arXiv, 2019. paper

  • Constitutional AI: Harmlessness from AI feedback. arXiv, 2022. paper

  • Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv, 2022. paper

  • Improving alignment of dialogue agents via targeted human judgements. arXiv, 2022. paper

  • Training language models to follow instructions with human feedback. NeurIPS, 2022. paper

  • Safe rlhf: Safe reinforcement learning from human feedback. arXiv, 2023. paper

  • RLTF: Reinforcement Learning from Unit Test Feedback. TMLR, 2023. paper

  • Aligning large multimodal models with factually augmented rlhf. arXiv, 2023. paper

  • Fine-grained human feedback gives better rewards for language model training. arXiv, 2023. paper

  • Human preference score: Better aligning text-to-image models with human preference. arXiv, 2023. paper

  • Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv, 2023. paper

  • Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv, 2023. paper

  • Tuning large multimodal models for videos using reinforcement learning from ai feedback. arXiv, 2024. paper

  • Stepcoder: Improve code generation with reinforcement learning from compiler feedback. arXiv, 2024. paper

  • RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning. arXiv, 2024. paper

  • Rich human feedback for text-to-image generation. arXiv, 2024. paper

  • Skywork-reward: Bag of tricks for reward modeling in llms. arXiv, 2024. paper

  • Lift: Leveraging human feedback for text-to-video model alignment. arXiv, 2024. paper

  • Self-taught evaluators. arXiv, 2024. paper

  • REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models. arXiv, 2025. paper

  • Learning to Reason under Off-Policy Guidance. arXiv, 2025. paper

  • VinePPO Refining Credit Assignment in RL Training of LLMs. arXiv, 2025. paper

  • Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv, 2023. paper

  • Generative judge for evaluating alignment. arXiv, 2023. paper

  • Critique-out-loud reward models. arXiv, 2024. paper

  • Compassjudger-1: All-in-one judge model helps model evaluation and evolution. arXiv, 2024. paper

  • Llm critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback. arXiv, 2024. paper

  • Direct judgement preference optimization. arXiv, 2024. paper

  • Llava-critic: Learning to evaluate multimodal models. arXiv, 2024. paper

  • Beyond Scalar Reward Model: Learning Generative Judge from Preference Data. arXiv, 2024. paper

  • Self-Generated Critiques Boost Reward Modeling for Language Models. arXiv, 2024. paper

  • Generative verifiers: Reward modeling as next-token prediction. arXiv, 2024. paper

  • Inference-Time Scaling for Generalist Reward Modeling. arXiv, 2025. paper

  • Think-J: Learning to Think for Generative LLM-as-a-Judge. arXiv, 2025. paper

  • Unified multimodal chain-of-thought reward model through reinforcement fine-tuning. arXiv, 2025. paper

  • Mm-rlhf: The next step forward in multimodal llm alignment. arXiv, 2025. paper

  • Improve LLM-as-a-Judge Ability as a General Ability. arXiv, 2025. paper

  • Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback. arXiv, 2025. paper

  • Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv, 2022. paper

  • Raft: Reward ranked finetuning for generative foundation model alignment. arXiv, 2023. paper

  • Reinforced Self-Training (ReST) for Language Modeling. arXiv, 2023. paper

  • Making language models better tool learners with execution feedback. arXiv, 2023. paper

  • Direct preference optimization: Your language model is secretly a reward model. arXiv, 2023. paper

  • RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback. arXiv, 2023. paper

  • Rrhf: Rank responses to align language models with human feedback without tears. arXiv, 2023. paper

  • Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization. arXiv, 2023. paper

  • KTO: Model Alignment as Prospect Theoretic Optimization. arXiv, 2024. paper

  • Preference Optimization for Reasoning with Pseudo Feedback. arXiv, 2024. paper

  • Step-DPO: Step-wise preference optimization for long-chain reasoning of llms. arXiv, 2024. paper

  • Flame: Factuality-aware alignment for large language models. arXiv, 2024. paper

  • Simpo: Simple preference optimization with a reference-free reward. arXiv, 2024. paper

  • Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization. arXiv, 2024. paper

  • Self-Consistency Preference Optimization. arXiv, 2024. paper

  • CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement. arXiv, 2024. paper

  • Diffusion model alignment using direct preference optimization. arXiv, 2024. paper

  • mdpo: Conditional preference optimization for multimodal large language models. arXiv, 2024. paper

  • Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge. arXiv, 2024. paper

  • RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness. arXiv, 2024. paper

  • Self-Rewarding Language Models. arXiv, 2024. paper

  • Aligning Modalities in Vision Large Language Models via Preference Fine-tuning. arXiv, 2024. paper

  • Unified Reward Model for Multimodal Understanding and Generation. arXiv, 2025. paper

  • RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation. arXiv, 2025. paper

  • Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv, 2024. paper

  • Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv, 2025. paper

  • Improving Generalization in Intent Detection: GRPO with Reward-Based Curriculum Sampling. arXiv, 2025. paper

  • Open r1: A fully open reproduction of deepseek-r1. GitHub, 2025. paper

  • CLS-RL: Image Classification with Rule-Based Reinforcement Learning. arXiv, 2025. paper

  • Visual-rft: Visual reinforcement fine-tuning. arXiv, 2025. paper

  • Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning. arXiv, 2025. paper

  • Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning. arXiv, 2025. paper

  • Dapo: An open-source llm reinforcement learning system at scale. arXiv, 2025. paper

  • R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. arXiv, 2025. paper

  • Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization. arXiv, 2025. paper

  • TTRL: Test-Time Reinforcement Learning. arXiv, 2025. paper

  • Spurious Rewards Rethinking Training Signals in RLVR. arXiv, 2025. paper

  • WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct. arXiv, 2023. paper

  • Math-shepherd: Verify and reinforce llms step-by-step without human annotations. arXiv, 2023. paper

  • Process reward model with q-value rankings. arXiv, 2024. paper

  • Diving into Self-Evolving Training for Multimodal Reasoning. arXiv, 2024. paper

  • Improve mathematical reasoning in language models by automated process supervision. arXiv, 2024. paper

  • Process reinforcement through implicit rewards. arXiv, 2025. paper

  • Efficient Process Reward Model Training via Active Learning. arXiv, 2025. paper

  • Process Reward Models That Think. arXiv, 2025. paper

  • Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning. arXiv, 2025. paper

  • R-PRM: Reasoning-Driven Process Reward Modeling. arXiv, 2025. paper

  • Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models. arXiv, 2025. paper

  • GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning. arXiv, 2025. paper

  • SCOPE: Compress Mathematical Reasoning Steps for Efficient Automated Process Annotation. arXiv, 2025. paper

🎯Inference with Rewards

  • Self-consistency improves chain of thought reasoning in language models. arXiv, 2022. paper

  • Solving math word problems via cooperative reasoning induced language models. arXiv, 2022. paper

  • Lever: Learning to verify language-to-code generation with execution. ICML, 2023. paper

  • Math-shepherd: Verify and reinforce llms step-by-step without human annotations. arXiv, 2023. paper

  • V-star: Training verifiers for self-taught reasoners. arXiv, 2024. paper

  • Fast best-of-n decoding via speculative rejection. arXiv, 2024. paper

  • Generative verifiers: Reward modeling as next-token prediction. arXiv, 2024. paper

  • ViLBench: A Suite for Vision-Language Process Reward Modeling. arXiv, 2025. paper

  • VisualPRM: An Effective Process Reward Model for Multimodal Reasoning. arXiv, 2025. paper

  • The lessons of developing process reward models in mathematical reasoning. arXiv, 2025. paper

  • Reward-augmented decoding: Efficient controlled text generation with a unidirectional reward model. arXiv, 2023. paper

  • Reasoning with language model is planning with world model. arXiv, 2023. paper

  • Grace: Discriminator-guided chain-of-thought reasoning. arXiv, 2023. paper

  • Tree of thoughts: Deliberate problem solving with large language models. arXiv, 2023. paper

  • Ovm, outcome-supervised value models for planning in mathematical reasoning. arXiv, 2023. paper

  • Planning with large language models for code generation. arXiv, 2023. paper

  • Enhancing llm reasoning with reward-guided tree search. arXiv, 2024. paper

  • ARGS: Alignment as Reward-Guided Search. arXiv, 2024. paper

  • Cascade reward sampling for efficient decoding-time alignment. arXiv, 2024. paper

  • Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning. arXiv, 2024. paper

  • Mutual reasoning makes smaller llms stronger problem-solvers. arXiv, 2024. paper

  • Efficient Controlled Language Generation with Low-Rank Autoregressive Reward Models. arXiv, 2024. paper

  • Outcome-Refining Process Supervision for Code Generation. arXiv, 2024. paper

  • Rest-mcts*: Llm self-training via process reward guided tree search. arXiv, 2024. paper

  • rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking. arXiv, 2025. paper

  • Reward-Guided Speculative Decoding for Efficient LLM Reasoning. arXiv, 2025. paper

  • Towards Cost-Effective Reward Guided Text Generation. arXiv, 2025. paper

  • AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time. arXiv, 2025. paper

🎯Post-Inference with Rewards

  • Chain-of-verification reduces hallucination in large language models. arXiv, 2023. paper

  • Self-refine: Iterative refinement with self-feedback. arXiv, 2023. paper

  • Reflexion: Language agents with verbal reinforcement learning. arXiv, 2023. paper

  • Training language models to self-correct via reinforcement learning. arXiv, 2024. paper

  • Recursive introspection: Teaching language model agents how to self-improve. arXiv, 2024. paper

  • Reward Is Enough LLMs Are In-Context Reinforcement Learners. arXiv, 2025. paper

  • Rarr: Researching and revising what language models say, using language models. arXiv, 2022. paper

  • Coderl: Mastering code generation through pretrained models and deep reinforcement learning. arXiv, 2022. paper

  • Teaching large language models to self-debug. arXiv, 2023. paper

  • FacTool: Factuality Detection in Generative AI--A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios. arXiv, 2023. paper

  • Baldur: Whole-proof generation and repair with large language models. arXiv, 2023. paper

  • Critic: Large language models can self-correct with tool-interactive critiquing. arXiv, 2023. paper

  • Large language models cannot self-correct reasoning yet. arXiv, 2023. paper

  • Selfevolve: A code evolution framework via large language models. arXiv, 2023. paper

  • Language models can solve computer tasks. arXiv, 2023. paper

  • Encouraging divergent thinking in large language models through multi-agent debate. arXiv, 2023. paper

  • Self-refine: Iterative refinement with self-feedback. arXiv, 2023. paper

  • Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies. arXiv, 2023. paper

  • Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning. arXiv, 2023. paper

  • Refiner: Reasoning feedback on intermediate representations. arXiv, 2023. paper

  • Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv, 2023. paper

  • Phenomenal yet puzzling: Testing inductive reasoning capabilities of language models with hypothesis refinement. arXiv, 2023. paper

  • Shepherd: A critic for language model generation. arXiv, 2023. paper

  • Improving language models via plug-and-play retrieval feedback. arXiv, 2023. paper

  • Self-edit: Fault-aware code editor for code generation. arXiv, 2023. paper

  • Dress: Instructing large vision-language models to align and interact with humans via natural language feedback. arXiv, 2024. paper

  • Cycle: Learning to self-refine the code generation. arXiv, 2024. paper

  • When can llms actually correct their own mistakes? a critical survey of self-correction of llms. arXiv, 2024. paper

  • Enhancing llm reasoning via critique models with test-time and training-time supervision. arXiv, 2024. paper

  • Two Heads Are Better Than One: Dual-Model Verbal Reflection at Inference-Time. arXiv, 2025. paper

  • Teaching Language Models to Critique via Reinforcement Learning. arXiv, 2025. paper

  • Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback. arXiv, 2025. paper

📏Benchmarking Reward Models

  • Rewardbench: Evaluating reward models for language modeling. arXiv, 2024. paper

  • Criticbench: Benchmarking llms for critique-correct reasoning. arXiv, 2024. paper

  • Acemath: Advancing frontier math reasoning with post-training and reward modeling. arXiv, 2024. paper

  • RM-bench: Benchmarking reward models of language models with subtlety and style. arXiv, 2024. paper

  • The critique of critique. arXiv, 2024. paper

  • RMB: Comprehensively Benchmarking Reward Models in LLM Alignment. arXiv, 2024. paper

  • Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators. arXiv, 2025. paper

  • Training verifiers to solve math word problems. arXiv, 2021. paper

  • Measuring mathematical problem solving with the math dataset. arXiv, 2021. paper

  • Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. TMLR, 2023. paper

  • LLMs cannot find reasoning errors, but can correct them given the error location. arXiv, 2023. paper

  • Mr-gsm8k: A meta-reasoning benchmark for large language model evaluation. arXiv, 2023. paper

  • Omni-math: A universal olympiad level mathematic benchmark for large language models. arXiv, 2024. paper

  • Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. arXiv, 2024. paper

  • Evaluating mathematical reasoning beyond accuracy. arXiv, 2024. paper

  • Mr-ben: A meta-reasoning benchmark for evaluating system-2 thinking in llms. arXiv, 2024. paper

  • Processbench: Identifying process errors in mathematical reasoning. arXiv, 2024. paper

  • Is your model really a good math reasoner? evaluating mathematical reasoning with checklist. arXiv, 2024. paper

  • PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models. arXiv, 2025. paper

  • MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?. arXiv, 2024. paper

  • Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. arXiv, 2024. paper

  • VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models. arXiv, 2024. paper

  • Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program. arXiv, 2025. paper

  • VLRMBench: A comprehensive and challenging benchmark for vision-language reward models. arXiv, 2025. paper

  • Multimodal rewardbench: Holistic evaluation of reward models for vision language models. arXiv, 2025. paper

  • How to Evaluate Reward Models for RLHF. arXiv, 2024. paper

  • M-RewardBench: Evaluating Reward Models in Multilingual Settings. arXiv, 2024. paper

  • RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment. arXiv, 2024. paper

🚀Applications

  • Constitutional AI: Harmlessness from AI feedback. arXiv, 2022. paper

  • Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv, 2022. paper

  • Training language models to follow instructions with human feedback. NeurIPS, 2022. paper

  • Reward-augmented decoding: Efficient controlled text generation with a unidirectional reward model. arXiv, 2023. paper

  • Beavertails: Towards improved safety alignment of llm via a human-preference dataset. arXiv, 2023. paper

  • Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv, 2023. paper

  • Aligning large multimodal models with factually augmented rlhf. arXiv, 2023. paper

  • Fine-tuning language models for factuality. arXiv, 2023. paper

  • Shepherd: A critic for language model generation. arXiv, 2023. paper

  • Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization. arXiv, 2023. paper

  • ARGS: Alignment as Reward-Guided Search. arXiv, 2024. paper

  • Flame: Factuality-aware alignment for large language models. arXiv, 2024. paper

  • On-policy fine-grained knowledge feedback for hallucination mitigation. arXiv, 2024. paper

  • Training verifiers to solve math word problems. arXiv, 2021. paper

  • Solving math word problems with process-and outcome-based feedback. arXiv, 2022. paper

  • Reasoning with language model is planning with world model. arXiv, 2023. paper

  • Let's verify step by step. arXiv, 2023. paper

  • WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct. arXiv, 2023. paper

  • Math-shepherd: Verify and reinforce llms step-by-step without human annotations. arXiv, 2023. paper

  • Step-DPO: Step-wise preference optimization for long-chain reasoning of llms. arXiv, 2024. paper

  • Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv, 2024. paper

  • DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition. arXiv, 2025. paper

  • rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking. arXiv, 2025. paper

  • Full-step-dpo: Self-supervised preference optimization with step-wise rewards for mathematical reasoning. arXiv, 2025. paper

  • Teaching large language models to self-debug. arXiv, 2023. paper

  • RLTF: Reinforcement Learning from Unit Test Feedback. Trans. Mach. Learn. Res., 2023. paper

  • Lever: Learning to verify language-to-code generation with execution. International Conference on Machine Learning, 2023. paper

  • Reflexion: Language agents with verbal reinforcement learning. arXiv, 2023. paper

  • Self-edit: Fault-aware code editor for code generation. arXiv, 2023. paper

  • Stepcoder: Improve code generation with reinforcement learning from compiler feedback. arXiv, 2024. paper

  • V-star: Training verifiers for self-taught reasoners. arXiv, 2024. paper

  • Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv, 2024. paper

  • CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement. arXiv, 2024. paper

  • Outcome-Refining Process Supervision for Code Generation. arXiv, 2024. paper

  • Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv, 2024. paper

  • Teaching Language Models to Critique via Reinforcement Learning. arXiv, 2025. paper

  • RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation. arXiv, 2025. paper

  • R1-V: Reinforcing Super Generalization Ability in Vision-Language Models with Less Than $3. arXiv, 2025. paper

  • Video-r1: Reinforcing video reasoning in mllms. arXiv, 2025. paper

  • Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step. arXiv, 2025. paper

  • Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv, 2025. paper

  • Q-Insight: Understanding Image Quality via Visual Reinforcement Learning. arXiv, 2025. paper

  • Relation-R1: Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relational Comprehension. arXiv, 2025. paper

  • VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning. arXiv, 2025. paper

  • OThink-MR1: Stimulating multimodal generalized reasoning capabilities through dynamic reinforcement learning. arXiv, 2025. paper

  • Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement. arXiv, 2025. paper

  • Video-t1: Test-time scaling for video generation. arXiv, 2025. paper

  • Visual-rft: Visual reinforcement fine-tuning. arXiv, 2025. paper

  • MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning. arXiv, 2025. paper

  • Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv, 2025. paper

  • Reason-rft: Reinforcement fine-tuning for visual reasoning. arXiv, 2025. paper

  • CrowdVLM-R1: Expanding R1 Ability to Vision Language Model for Crowd Counting using Fuzzy Group Relative Policy Reward. arXiv, 2025. paper

  • TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning. arXiv, 2025. paper

  • R1-Zero's" Aha Moment" in Visual Reasoning on a 2B Non-SFT Model. arXiv, 2025. paper

  • Process reward models for llm agents: Practical framework and directions. arXiv, 2025. paper

  • InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners. arXiv, 2025. paper

  • UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning. arXiv, 2025. paper

  • KBQA-o1: Agentic Knowledge Base Question Answering with Monte Carlo Tree Search. arXiv, 2025. paper

  • Introducing deep research. OpenAI, 2025. paper

  • RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning. arXiv, 2025. paper

  • AgentRM: Enhancing Agent Generalization with Reward Modeling. arXiv, 2025. paper

  • DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments. arXiv, 2025. paper

  • Cosmos-reason1: From physical common sense to embodied reasoning. arXiv, 2025. paper

  • ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning. arXiv, 2025. paper

  • ReTool: Reinforcement Learning for Strategic Tool Use in LLMs. arXiv, 2025. paper

  • Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use. arXiv, 2025. paper

  • Improving Vision-Language-Action Model with Online Reinforcement Learning. arXiv, 2025. paper

  • DeepRetrieval: Hacking Real Search Engines and Retrievers with Large Language Models via Reinforcement Learning. arXiv, 2025. paper

  • Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv, 2025. paper

  • Torl: Scaling tool-integrated rl. arXiv, 2025. paper

  • WebThinker: Empowering Large Reasoning Models with Deep Research Capability. arXiv, 2025. paper

  • Rec-R1: Bridging Generative Large Language Models and User-Centric Recommendation Systems via Reinforcement Learning. arXiv, 2025. paper

  • Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning. arXiv, 2025. paper

  • SQL-R1: Training Natural Language to SQL Reasoning Model By Reinforcement Learning. arXiv, 2025. paper

  • Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning. arXiv, 2025. paper

  • ToolRL: Reward is All Tool Learning Needs. arXiv, 2025. paper

  • R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning. arXiv, 2025. paper

  • OTC: Optimal Tool Calls via Reinforcement Learning. arXiv, 2025. paper

  • Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution. arXiv, 2025. paper

  • Embodied-reasoner: Synergizing visual search, reasoning, and action for embodied interactive tasks. arXiv, 2025. paper

  • DianJin-R1: Evaluating and Enhancing Financial Reasoning in Large Language Models. arXiv, 2025. paper

💡Challenges and Future Directions

  • Avoiding tampering incentives in deep rl via decoupled approval. arXiv, 2020. paper

  • Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective. arXiv, 2021. paper

  • Preprocessing reward functions for interpretability. arXiv, 2022. paper

  • The effects of reward misspecification: Mapping and mitigating misaligned models. arXiv, 2022. paper

  • A definition of continual reinforcement learning. arXiv, 2023. paper

  • Interpretable preferences via multi-objective reward modeling and mixture-of-experts. arXiv, 2024. paper

  • Sycophancy to subterfuge: Investigating reward-tampering in large language models. arXiv, 2024. paper

  • Feedback loops with language models drive in-context reward hacking. arXiv, 2024. paper

  • Spontaneous Reward Hacking in Iterative Self-Refinement. arXiv, 2024. paper

  • Reward Hacking in Reinforcement Learning.. lilianweng.github.io, 2024. paper

  • CPPO: Continual learning for reinforcement learning with human feedback. arXiv, 2024. paper

  • Process Reward Models That Think. arXiv, 2025. paper

  • GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning. arXiv, 2025. paper

  • Inference-Time Scaling for Generalist Reward Modeling. arXiv, 2025. paper

  • Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. arXiv, 2025. paper

  • RRM: Robust Reward Model Training Mitigates Reward Hacking. arXiv, 2025. paper

  • Agentic reward modeling: Integrating human preferences with verifiable correctness signals for reliable reward systems. arXiv, 2025. paper

  • What makes a reward model a good teacher? an optimization perspective. arXiv, 2025. paper

  • Seal: Systematic error analysis for value alignment. arXiv, 2025. paper

  • Exploring data scaling trends and effects in reinforcement learning from human feedback. arXiv, 2025. paper

  • Welcome to the Era of Experience. Google AI, 2025. paper

  • Rethinking the Foundations for Continual Reinforcement Learning. arXiv, 2025. paper

📬Contact

  • We welcome your contributions to this project. Please feel free to submit pull requests.
  • If you encounter any issues, please either directly contact Xiaobao Wu (xiaobao002@e.ntu.edu.sg) or leave an issue in the GitHub repo.

📖Citation

If you are interested in our survey paper, please cite it as

@article{wu2025sailing,
    title    = {Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models},
    author   = {Wu, Xiaobao},
    year     = 2025,
    journal  = {arXiv preprint arXiv:2505.02686},
    url      = {https://arxiv.org/pdf/2505.02686}
}

About

A comrephensive collection of learning from rewards in the post-training and test-time scaling of LLMs, with a focus on both reward models and learning strategies across training, inference, and post-inference stages.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published