Name	Name	Last commit message	Last commit date
Latest commit History 13 Commits
img	img
LICENSE	LICENSE
README.md	README.md

📚Awesome-Learning-from-Rewards-Papers

This repository accompanies our survey paper:
Sailing by the Stars: A Survey on Reward Models and Learning Strategies for Learning from Rewards.
We curate a comrephensive collection of learning from rewards in the post-training and test-time scaling of LLMs, with a focus on both reward models and learning strategies across training, inference, and post-inference stages.

We welcome contributions from the community, so feel free to submit pull requests to add missing papers!

Figure 1: Scaling phases of LLMs.

Figure 2: Conceptual framework of learning from rewards.

📚Awesome-Learning-from-Rewards-Papers

🎯Training with Rewards

Training with Scalar Rewards

Proximal policy optimization algorithms. arXiv, 2017. paper
Fine-Tuning Language Models from Human Preferences. arXiv, 2019. paper
Constitutional AI: Harmlessness from AI feedback. arXiv, 2022. paper
Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv, 2022. paper
Improving alignment of dialogue agents via targeted human judgements. arXiv, 2022. paper
Training language models to follow instructions with human feedback. NeurIPS, 2022. paper
Safe rlhf: Safe reinforcement learning from human feedback. arXiv, 2023. paper
RLTF: Reinforcement Learning from Unit Test Feedback. TMLR, 2023. paper
Aligning large multimodal models with factually augmented rlhf. arXiv, 2023. paper
Fine-grained human feedback gives better rewards for language model training. arXiv, 2023. paper
Human preference score: Better aligning text-to-image models with human preference. arXiv, 2023. paper
Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv, 2023. paper
Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv, 2023. paper
Tuning large multimodal models for videos using reinforcement learning from ai feedback. arXiv, 2024. paper
Stepcoder: Improve code generation with reinforcement learning from compiler feedback. arXiv, 2024. paper
RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning. arXiv, 2024. paper
Rich human feedback for text-to-image generation. arXiv, 2024. paper
Skywork-reward: Bag of tricks for reward modeling in llms. arXiv, 2024. paper
Lift: Leveraging human feedback for text-to-video model alignment. arXiv, 2024. paper
Self-taught evaluators. arXiv, 2024. paper
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models. arXiv, 2025. paper
Learning to Reason under Off-Policy Guidance. arXiv, 2025. paper
VinePPO Refining Credit Assignment in RL Training of LLMs. arXiv, 2025. paper

Training with Critique Rewards

Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv, 2023. paper
Generative judge for evaluating alignment. arXiv, 2023. paper
Critique-out-loud reward models. arXiv, 2024. paper
Compassjudger-1: All-in-one judge model helps model evaluation and evolution. arXiv, 2024. paper
Llm critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback. arXiv, 2024. paper
Direct judgement preference optimization. arXiv, 2024. paper
Llava-critic: Learning to evaluate multimodal models. arXiv, 2024. paper
Beyond Scalar Reward Model: Learning Generative Judge from Preference Data. arXiv, 2024. paper
Self-Generated Critiques Boost Reward Modeling for Language Models. arXiv, 2024. paper
Generative verifiers: Reward modeling as next-token prediction. arXiv, 2024. paper
Inference-Time Scaling for Generalist Reward Modeling. arXiv, 2025. paper
Think-J: Learning to Think for Generative LLM-as-a-Judge. arXiv, 2025. paper
Unified multimodal chain-of-thought reward model through reinforcement fine-tuning. arXiv, 2025. paper

Training with Hybrid Rewards

Mm-rlhf: The next step forward in multimodal llm alignment. arXiv, 2025. paper
Improve LLM-as-a-Judge Ability as a General Ability. arXiv, 2025. paper
Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback. arXiv, 2025. paper

Training with Implicit Rewards

Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv, 2022. paper
Raft: Reward ranked finetuning for generative foundation model alignment. arXiv, 2023. paper
Reinforced Self-Training (ReST) for Language Modeling. arXiv, 2023. paper
Making language models better tool learners with execution feedback. arXiv, 2023. paper
Direct preference optimization: Your language model is secretly a reward model. arXiv, 2023. paper
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback. arXiv, 2023. paper
Rrhf: Rank responses to align language models with human feedback without tears. arXiv, 2023. paper
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization. arXiv, 2023. paper
KTO: Model Alignment as Prospect Theoretic Optimization. arXiv, 2024. paper
Preference Optimization for Reasoning with Pseudo Feedback. arXiv, 2024. paper
Step-DPO: Step-wise preference optimization for long-chain reasoning of llms. arXiv, 2024. paper
Flame: Factuality-aware alignment for large language models. arXiv, 2024. paper
Simpo: Simple preference optimization with a reference-free reward. arXiv, 2024. paper
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization. arXiv, 2024. paper
Self-Consistency Preference Optimization. arXiv, 2024. paper
CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement. arXiv, 2024. paper
Diffusion model alignment using direct preference optimization. arXiv, 2024. paper
mdpo: Conditional preference optimization for multimodal large language models. arXiv, 2024. paper
Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge. arXiv, 2024. paper
RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness. arXiv, 2024. paper
Self-Rewarding Language Models. arXiv, 2024. paper
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning. arXiv, 2024. paper
Unified Reward Model for Multimodal Understanding and Generation. arXiv, 2025. paper
RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation. arXiv, 2025. paper

Training with Rule-based Rewards

Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv, 2024. paper
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv, 2025. paper
Improving Generalization in Intent Detection: GRPO with Reward-Based Curriculum Sampling. arXiv, 2025. paper
Open r1: A fully open reproduction of deepseek-r1. GitHub, 2025. paper
CLS-RL: Image Classification with Rule-Based Reinforcement Learning. arXiv, 2025. paper
Visual-rft: Visual reinforcement fine-tuning. arXiv, 2025. paper
Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning. arXiv, 2025. paper
Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning. arXiv, 2025. paper
Dapo: An open-source llm reinforcement learning system at scale. arXiv, 2025. paper
R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. arXiv, 2025. paper
Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization. arXiv, 2025. paper
TTRL: Test-Time Reinforcement Learning. arXiv, 2025. paper
Spurious Rewards Rethinking Training Signals in RLVR. arXiv, 2025. paper

Training with Process Rewards

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct. arXiv, 2023. paper
Math-shepherd: Verify and reinforce llms step-by-step without human annotations. arXiv, 2023. paper
Process reward model with q-value rankings. arXiv, 2024. paper
Diving into Self-Evolving Training for Multimodal Reasoning. arXiv, 2024. paper
Improve mathematical reasoning in language models by automated process supervision. arXiv, 2024. paper
Process reinforcement through implicit rewards. arXiv, 2025. paper
Efficient Process Reward Model Training via Active Learning. arXiv, 2025. paper
Process Reward Models That Think. arXiv, 2025. paper
Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning. arXiv, 2025. paper
R-PRM: Reasoning-Driven Process Reward Modeling. arXiv, 2025. paper
Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models. arXiv, 2025. paper
GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning. arXiv, 2025. paper
SCOPE: Compress Mathematical Reasoning Steps for Efficient Automated Process Annotation. arXiv, 2025. paper

🎯Inference with Rewards

Generate-then-Rank

Self-consistency improves chain of thought reasoning in language models. arXiv, 2022. paper
Solving math word problems via cooperative reasoning induced language models. arXiv, 2022. paper
Lever: Learning to verify language-to-code generation with execution. ICML, 2023. paper
Math-shepherd: Verify and reinforce llms step-by-step without human annotations. arXiv, 2023. paper
V-star: Training verifiers for self-taught reasoners. arXiv, 2024. paper
Fast best-of-n decoding via speculative rejection. arXiv, 2024. paper
Generative verifiers: Reward modeling as next-token prediction. arXiv, 2024. paper
ViLBench: A Suite for Vision-Language Process Reward Modeling. arXiv, 2025. paper
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning. arXiv, 2025. paper
The lessons of developing process reward models in mathematical reasoning. arXiv, 2025. paper

Reward-Guided Decoding

Reward-augmented decoding: Efficient controlled text generation with a unidirectional reward model. arXiv, 2023. paper
Reasoning with language model is planning with world model. arXiv, 2023. paper
Grace: Discriminator-guided chain-of-thought reasoning. arXiv, 2023. paper
Tree of thoughts: Deliberate problem solving with large language models. arXiv, 2023. paper
Ovm, outcome-supervised value models for planning in mathematical reasoning. arXiv, 2023. paper
Planning with large language models for code generation. arXiv, 2023. paper
Enhancing llm reasoning with reward-guided tree search. arXiv, 2024. paper
ARGS: Alignment as Reward-Guided Search. arXiv, 2024. paper
Cascade reward sampling for efficient decoding-time alignment. arXiv, 2024. paper
Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning. arXiv, 2024. paper
Mutual reasoning makes smaller llms stronger problem-solvers. arXiv, 2024. paper
Efficient Controlled Language Generation with Low-Rank Autoregressive Reward Models. arXiv, 2024. paper
Outcome-Refining Process Supervision for Code Generation. arXiv, 2024. paper
Rest-mcts*: Llm self-training via process reward guided tree search. arXiv, 2024. paper
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking. arXiv, 2025. paper
Reward-Guided Speculative Decoding for Efficient LLM Reasoning. arXiv, 2025. paper
Towards Cost-Effective Reward Guided Text Generation. arXiv, 2025. paper
AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time. arXiv, 2025. paper

🎯Post-Inference with Rewards

Self-Correction

Chain-of-verification reduces hallucination in large language models. arXiv, 2023. paper
Self-refine: Iterative refinement with self-feedback. arXiv, 2023. paper
Reflexion: Language agents with verbal reinforcement learning. arXiv, 2023. paper
Training language models to self-correct via reinforcement learning. arXiv, 2024. paper
Recursive introspection: Teaching language model agents how to self-improve. arXiv, 2024. paper
Reward Is Enough LLMs Are In-Context Reinforcement Learners. arXiv, 2025. paper

Correction with External Feedback

Rarr: Researching and revising what language models say, using language models. arXiv, 2022. paper
Coderl: Mastering code generation through pretrained models and deep reinforcement learning. arXiv, 2022. paper
Teaching large language models to self-debug. arXiv, 2023. paper
FacTool: Factuality Detection in Generative AI--A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios. arXiv, 2023. paper
Baldur: Whole-proof generation and repair with large language models. arXiv, 2023. paper
Critic: Large language models can self-correct with tool-interactive critiquing. arXiv, 2023. paper
Large language models cannot self-correct reasoning yet. arXiv, 2023. paper
Selfevolve: A code evolution framework via large language models. arXiv, 2023. paper
Language models can solve computer tasks. arXiv, 2023. paper
Encouraging divergent thinking in large language models through multi-agent debate. arXiv, 2023. paper
Self-refine: Iterative refinement with self-feedback. arXiv, 2023. paper
Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies. arXiv, 2023. paper
Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning. arXiv, 2023. paper
Refiner: Reasoning feedback on intermediate representations. arXiv, 2023. paper
Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv, 2023. paper
Phenomenal yet puzzling: Testing inductive reasoning capabilities of language models with hypothesis refinement. arXiv, 2023. paper
Shepherd: A critic for language model generation. arXiv, 2023. paper
Improving language models via plug-and-play retrieval feedback. arXiv, 2023. paper
Self-edit: Fault-aware code editor for code generation. arXiv, 2023. paper
Dress: Instructing large vision-language models to align and interact with humans via natural language feedback. arXiv, 2024. paper
Cycle: Learning to self-refine the code generation. arXiv, 2024. paper
When can llms actually correct their own mistakes? a critical survey of self-correction of llms. arXiv, 2024. paper
Enhancing llm reasoning via critique models with test-time and training-time supervision. arXiv, 2024. paper
Two Heads Are Better Than One: Dual-Model Verbal Reflection at Inference-Time. arXiv, 2025. paper
Teaching Language Models to Critique via Reinforcement Learning. arXiv, 2025. paper
Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback. arXiv, 2025. paper

📏Benchmarking Reward Models

Benchmarking Outcome Reward Models

Rewardbench: Evaluating reward models for language modeling. arXiv, 2024. paper
Criticbench: Benchmarking llms for critique-correct reasoning. arXiv, 2024. paper
Acemath: Advancing frontier math reasoning with post-training and reward modeling. arXiv, 2024. paper
RM-bench: Benchmarking reward models of language models with subtlety and style. arXiv, 2024. paper
The critique of critique. arXiv, 2024. paper
RMB: Comprehensively Benchmarking Reward Models in LLM Alignment. arXiv, 2024. paper
Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators. arXiv, 2025. paper

Benchmarking Process Reward Models

Training verifiers to solve math word problems. arXiv, 2021. paper
Measuring mathematical problem solving with the math dataset. arXiv, 2021. paper
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. TMLR, 2023. paper
LLMs cannot find reasoning errors, but can correct them given the error location. arXiv, 2023. paper
Mr-gsm8k: A meta-reasoning benchmark for large language model evaluation. arXiv, 2023. paper
Omni-math: A universal olympiad level mathematic benchmark for large language models. arXiv, 2024. paper
Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. arXiv, 2024. paper
Evaluating mathematical reasoning beyond accuracy. arXiv, 2024. paper
Mr-ben: A meta-reasoning benchmark for evaluating system-2 thinking in llms. arXiv, 2024. paper
Processbench: Identifying process errors in mathematical reasoning. arXiv, 2024. paper
Is your model really a good math reasoner? evaluating mathematical reasoning with checklist. arXiv, 2024. paper
PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models. arXiv, 2025. paper

Benchmarking Multimodal Reward Models

MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?. arXiv, 2024. paper
Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. arXiv, 2024. paper
VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models. arXiv, 2024. paper
Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program. arXiv, 2025. paper
VLRMBench: A comprehensive and challenging benchmark for vision-language reward models. arXiv, 2025. paper
Multimodal rewardbench: Holistic evaluation of reward models for vision language models. arXiv, 2025. paper

Other Benchmarks

How to Evaluate Reward Models for RLHF. arXiv, 2024. paper
M-RewardBench: Evaluating Reward Models in Multilingual Settings. arXiv, 2024. paper
RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment. arXiv, 2024. paper

🚀Applications

Preference Alignment

Constitutional AI: Harmlessness from AI feedback. arXiv, 2022. paper
Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv, 2022. paper
Training language models to follow instructions with human feedback. NeurIPS, 2022. paper
Reward-augmented decoding: Efficient controlled text generation with a unidirectional reward model. arXiv, 2023. paper
Beavertails: Towards improved safety alignment of llm via a human-preference dataset. arXiv, 2023. paper
Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv, 2023. paper
Aligning large multimodal models with factually augmented rlhf. arXiv, 2023. paper
Fine-tuning language models for factuality. arXiv, 2023. paper
Shepherd: A critic for language model generation. arXiv, 2023. paper
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization. arXiv, 2023. paper
ARGS: Alignment as Reward-Guided Search. arXiv, 2024. paper
Flame: Factuality-aware alignment for large language models. arXiv, 2024. paper
On-policy fine-grained knowledge feedback for hallucination mitigation. arXiv, 2024. paper

Mathematical Reasoning

Training verifiers to solve math word problems. arXiv, 2021. paper
Solving math word problems with process-and outcome-based feedback. arXiv, 2022. paper
Reasoning with language model is planning with world model. arXiv, 2023. paper
Let's verify step by step. arXiv, 2023. paper
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct. arXiv, 2023. paper
Math-shepherd: Verify and reinforce llms step-by-step without human annotations. arXiv, 2023. paper
Step-DPO: Step-wise preference optimization for long-chain reasoning of llms. arXiv, 2024. paper
Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv, 2024. paper
DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition. arXiv, 2025. paper
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking. arXiv, 2025. paper
Full-step-dpo: Self-supervised preference optimization with step-wise rewards for mathematical reasoning. arXiv, 2025. paper

Code Generation

Teaching large language models to self-debug. arXiv, 2023. paper
RLTF: Reinforcement Learning from Unit Test Feedback. Trans. Mach. Learn. Res., 2023. paper
Lever: Learning to verify language-to-code generation with execution. International Conference on Machine Learning, 2023. paper
Reflexion: Language agents with verbal reinforcement learning. arXiv, 2023. paper
Self-edit: Fault-aware code editor for code generation. arXiv, 2023. paper
Stepcoder: Improve code generation with reinforcement learning from compiler feedback. arXiv, 2024. paper
V-star: Training verifiers for self-taught reasoners. arXiv, 2024. paper
Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv, 2024. paper
CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement. arXiv, 2024. paper
Outcome-Refining Process Supervision for Code Generation. arXiv, 2024. paper
Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv, 2024. paper
Teaching Language Models to Critique via Reinforcement Learning. arXiv, 2025. paper
RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation. arXiv, 2025. paper

Multimodal Tasks

R1-V: Reinforcing Super Generalization Ability in Vision-Language Models with Less Than $3. arXiv, 2025. paper
Video-r1: Reinforcing video reasoning in mllms. arXiv, 2025. paper
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step. arXiv, 2025. paper
Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv, 2025. paper
Q-Insight: Understanding Image Quality via Visual Reinforcement Learning. arXiv, 2025. paper
Relation-R1: Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relational Comprehension. arXiv, 2025. paper
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning. arXiv, 2025. paper
OThink-MR1: Stimulating multimodal generalized reasoning capabilities through dynamic reinforcement learning. arXiv, 2025. paper
Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement. arXiv, 2025. paper
Video-t1: Test-time scaling for video generation. arXiv, 2025. paper
Visual-rft: Visual reinforcement fine-tuning. arXiv, 2025. paper
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning. arXiv, 2025. paper
Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv, 2025. paper
Reason-rft: Reinforcement fine-tuning for visual reasoning. arXiv, 2025. paper
CrowdVLM-R1: Expanding R1 Ability to Vision Language Model for Crowd Counting using Fuzzy Group Relative Policy Reward. arXiv, 2025. paper
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning. arXiv, 2025. paper
R1-Zero's" Aha Moment" in Visual Reasoning on a 2B Non-SFT Model. arXiv, 2025. paper

Agents

Process reward models for llm agents: Practical framework and directions. arXiv, 2025. paper
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners. arXiv, 2025. paper
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning. arXiv, 2025. paper
KBQA-o1: Agentic Knowledge Base Question Answering with Monte Carlo Tree Search. arXiv, 2025. paper
Introducing deep research. OpenAI, 2025. paper
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning. arXiv, 2025. paper
AgentRM: Enhancing Agent Generalization with Reward Modeling. arXiv, 2025. paper
DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments. arXiv, 2025. paper

Other Applications

Cosmos-reason1: From physical common sense to embodied reasoning. arXiv, 2025. paper
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning. arXiv, 2025. paper
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs. arXiv, 2025. paper
Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use. arXiv, 2025. paper
Improving Vision-Language-Action Model with Online Reinforcement Learning. arXiv, 2025. paper
DeepRetrieval: Hacking Real Search Engines and Retrievers with Large Language Models via Reinforcement Learning. arXiv, 2025. paper
Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv, 2025. paper
Torl: Scaling tool-integrated rl. arXiv, 2025. paper
WebThinker: Empowering Large Reasoning Models with Deep Research Capability. arXiv, 2025. paper
Rec-R1: Bridging Generative Large Language Models and User-Centric Recommendation Systems via Reinforcement Learning. arXiv, 2025. paper
Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning. arXiv, 2025. paper
SQL-R1: Training Natural Language to SQL Reasoning Model By Reinforcement Learning. arXiv, 2025. paper
Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning. arXiv, 2025. paper
ToolRL: Reward is All Tool Learning Needs. arXiv, 2025. paper
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning. arXiv, 2025. paper
OTC: Optimal Tool Calls via Reinforcement Learning. arXiv, 2025. paper
Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution. arXiv, 2025. paper
Embodied-reasoner: Synergizing visual search, reasoning, and action for embodied interactive tasks. arXiv, 2025. paper
DianJin-R1: Evaluating and Enhancing Financial Reasoning in Large Language Models. arXiv, 2025. paper

💡Challenges and Future Directions

Avoiding tampering incentives in deep rl via decoupled approval. arXiv, 2020. paper
Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective. arXiv, 2021. paper
Preprocessing reward functions for interpretability. arXiv, 2022. paper
The effects of reward misspecification: Mapping and mitigating misaligned models. arXiv, 2022. paper
A definition of continual reinforcement learning. arXiv, 2023. paper
Interpretable preferences via multi-objective reward modeling and mixture-of-experts. arXiv, 2024. paper
Sycophancy to subterfuge: Investigating reward-tampering in large language models. arXiv, 2024. paper
Feedback loops with language models drive in-context reward hacking. arXiv, 2024. paper
Spontaneous Reward Hacking in Iterative Self-Refinement. arXiv, 2024. paper
Reward Hacking in Reinforcement Learning.. lilianweng.github.io, 2024. paper
CPPO: Continual learning for reinforcement learning with human feedback. arXiv, 2024. paper
Process Reward Models That Think. arXiv, 2025. paper
GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning. arXiv, 2025. paper
Inference-Time Scaling for Generalist Reward Modeling. arXiv, 2025. paper
Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. arXiv, 2025. paper
RRM: Robust Reward Model Training Mitigates Reward Hacking. arXiv, 2025. paper
Agentic reward modeling: Integrating human preferences with verifiable correctness signals for reliable reward systems. arXiv, 2025. paper
What makes a reward model a good teacher? an optimization perspective. arXiv, 2025. paper
Seal: Systematic error analysis for value alignment. arXiv, 2025. paper
Exploring data scaling trends and effects in reinforcement learning from human feedback. arXiv, 2025. paper
Welcome to the Era of Experience. Google AI, 2025. paper
Rethinking the Foundations for Continual Reinforcement Learning. arXiv, 2025. paper

📬Contact

We welcome your contributions to this project. Please feel free to submit pull requests.
If you encounter any issues, please either directly contact Xiaobao Wu (xiaobao002@e.ntu.edu.sg) or leave an issue in the GitHub repo.

📖Citation

If you are interested in our survey paper, please cite it as

@article{wu2025sailing,
    title    = {Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models},
    author   = {Wu, Xiaobao},
    year     = 2025,
    journal  = {arXiv preprint arXiv:2505.02686},
    url      = {https://arxiv.org/pdf/2505.02686}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📚Awesome-Learning-from-Rewards-Papers

🎯Training with Rewards

Training with Scalar Rewards

Training with Critique Rewards

Training with Hybrid Rewards

Training with Implicit Rewards

Training with Rule-based Rewards

Training with Process Rewards

🎯Inference with Rewards

Generate-then-Rank

Reward-Guided Decoding

🎯Post-Inference with Rewards

Self-Correction

Correction with External Feedback

📏Benchmarking Reward Models

Benchmarking Outcome Reward Models

Benchmarking Process Reward Models

Benchmarking Multimodal Reward Models

Other Benchmarks

🚀Applications

Preference Alignment

Mathematical Reasoning

Code Generation

Multimodal Tasks

Agents

Other Applications

💡Challenges and Future Directions

📬Contact

📖Citation

About

Uh oh!

Releases

Packages

License

bobxwu/learning-from-rewards-llm-papers

Folders and files

Latest commit

History

Repository files navigation

📚Awesome-Learning-from-Rewards-Papers

🎯Training with Rewards

🎯Inference with Rewards

🎯Post-Inference with Rewards

📏Benchmarking Reward Models

🚀Applications

💡Challenges and Future Directions

📬Contact

📖Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks