With significant advances in Vision-Language-Action (VLA)🍔 models based on large-scale imitation learning, integrating VLA with Reinforcement Learning (RL)🥤 has emerged as a promising paradigm. This paradigm leverages the benefits of trial-and-error interactions with environments or pre-collected sub-optimal data.
This repository summarizes recent advances in the VLA🍔 + RL🥤 paradigm and provides a classification of relevant works (offline RL training(without env.), online RL training(with env.), test-time RL(during deployment), and RL alignment).
Contributions are welcome! Please feel free to submit an issue or reach out via email to add papers!
If you find this repository useful, please giving this list a star ⭐. Feel free to share it with others!
The Offline RL pre-trained VLA models leverage both human demonstrations and autonomously collected data.
Method | Title | Venue | Date | Code/Project | Key feature/finding |
---|---|---|---|---|---|
Q-Transformer | Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions | Arxiv | 18/9/2023 | Github | Detailsoffline Q-learning with Transformer models: 1. Autoregressive Discrete Q-Learning; 2. Conservative Q-Learning; 3. Monte Carlo and n-step Returns |
Perceiver-Actor-Critic | Offline Actor-Critic Reinforcement Learning Scales to Large Models | ICML2024 | 8/2/2024 | Project | DetailsAn offline actor-critic method that scales to large models of up to 1B parameters and learn a wide variety of 132 control and robotics tasks |
GeRM | GeRM: A Generalist Robotic Model with Mixture-of-experts for Quadruped Robot | IROS2024 | 20/3/2024 | Github | DetailsMixtureof-Experts structure; Quadruped robot learning |
ReinboT | ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning | ICML2025 | 12/5/2025 | DetailsMax-Return Sequence Modeling as Reinformer; Reward Densification with heuristic methods |
|
MoRE | MoRE: Unlocking Scalability in Reinforcement Learning for Quadruped Vision-Language-Action Models | ICRA2025 | 11/3/2025 | DetailsIntegrates multiple low-rank adaptation modules as distinct experts within a dense multi-modal large language model (MLLM), forming a sparse-activated mixture-of-experts model |
|
CO-RFT | CO-RFT: Efficient Fine-Tuning of Vision-Language-Action Models through Chunked Offline Reinforcement Learning | Arxiv | 4/8/2025 | DetailsChunk-level offline RL finetuning. It proposed Chunked RL via n-step TD learning |
With trial-and-error interactions in online environments, VLA models can be further optimized to improve their performance.
Method | Title | Venue | Date | Code/Project | Key feature/finding |
---|---|---|---|---|---|
FLaRe | FLaRe: Achieving Masterful and Adaptive Robot Policies with Large-Scale Reinforcement Learning Fine-Tuning | ICRA 2025 Best Paper Finalist | 30/9/2024 | Code | DetailsFor large-scale fine-tuning in simulation, it performs extensive domain randomization, extract visual features through DinoV2, and utilize the KV-cache technique during inference and a set of algorithmic choices to ensure the stability of RL fine-tuning |
PA-RL | Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone | Arxiv | 9/12/2024 | Project | Detailsa single method that fine-tunes multiple policy classes, with varying architectures and sizes. It enables sample-efficient improvement of diffusion and transformer-based autoregressive policies. PA-RL sets a new state of the art for offline to online RL, and it makes it possible, for the first time, to improve OpenVLA |
iRe-VLA | Improving Vision-Language-Action Model with Online Reinforcement Learning | RAL2025 | 28/1/2025 | DetailsAdopt SFT & RL two-stage iterative optimization to Stabilizing Training Process and Managing the Model Training Burden. |
|
RIPT-VLA | Interactive Post-Training for Vision-Language-Action Models | Arxiv | 22/5/2025 | Github | DetailsA critic-free optimization framework called Leave-One-Out Proximal Policy Optimization (LOOP); Dynamic rollout sampling |
VLA-RL | VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning | Arxiv | 24/5/2025 | Github | DetailsRobotic process reward model and the VLA-RL System with (1) Curriculum Selection Strategy (2) Critic Warmup (3) GPU-balanced Vectorized Environments (4) PPO infrastructure |
RLVLA | What Can RL Bring to VLA Generalization? An Empirical Study | Arxiv | 26/5/2025 | Github | DetailsPPO consistently outperforms GRPO and DPO; Shared actor-critic backbone; VLA warm-up |
RFTF | RFTF: Reinforcement Fine-tuning for Embodied Agents with Temporal Feedback | Arxiv | 26/5/2025 | DetailsFor the sparse reward problem, RFTF leverages a value model trained using temporal information to generate dense rewards |
|
SimpleVLA-RL | Github | 5/2025 | Github | ||
TGRPO | TGRPO: Fine-tuning Vision-Language-Action Model via Trajectory-wise Group Relative Policy Optimization | Arxiv | 10/6/2025 | Github | DetailsFrom GRPO in LLM to TGRPO in VLA |
OctoNav | OctoNav: Towards Generalist Embodied Navigation | Arxiv | 11/6/2025 | Project | DetailsFor Navigation tasks, it proposes a VLA+RL Hybrid Training Paradigm, including SFT, Nav-GRPO, Online RL stages. The VLA model also obtains thinking-before-action ability. |
RLRC | RLRC: Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models | Arxiv | 21/6/2025 | Project | DetailsA RL-based VLA compression Paradigm. Through a carefully designed three-stage pipeline, structured pruning, performance recovery based on SFT and RL, and 4bit quantization, they significantly reduce model size and boost inference speed while preserving, and in some cases surpassing, the original model’s ability to execute robotic tasks |
Method | Title | Venue | Date | Code/Project | Key feature/finding |
---|---|---|---|---|---|
RLDG | RLDG: Robotic Generalist Policy Distillation via Reinforcement Learning | RSS2025 | 12/2024 | Project | DetailsPretrain task-specific RL policies with HIL-SERL; Distill RL policies into VLA for Knowledge Transfer. |
PA-RL | Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone | Arxiv | 9/12/2024 | Project | Detailsa single method that fine-tunes multiple policy classes, with varying architectures and sizes. It enables sample-efficient improvement of diffusion and transformer-based autoregressive policies. PA-RL sets a new state of the art for offline to online RL, and it makes it possible, for the first time, to improve OpenVLA |
iRe-VLA | Improving Vision-Language-Action Model with Online Reinforcement Learning | RAL2025 | 28/1/2025 | DetailsAdopt SFT & RL two-stage iterative optimization to Stabilizing Training Process and Managing the Model Training Burden. |
|
ConRFT | ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy | RSS2025 | 14/4/2025 | Github | DetailsOffline fine-tuning(Cal-QL+PA-RL) and online fine-tuning(CPQL+HIL-SERL+PA-RL) |
Leverage a value function pre-trained via offline RL.
Method | Title | Venue | Date | Code/Project | Key feature/finding |
---|---|---|---|---|---|
Bellman-Guided Retrials | To Err is Robotic: Rapid Value-Based Trial-and-Error during Deployment | Arxiv | 22/6/2024 | Github | DetailsPre-train a value function to estimate task completion, recover the robot and sample a new strategy if failed |
V-GPS | Steering Your Generalists: Improving Robotic Foundation Models via Value Guidance | CoRL2024 | 17/10/2024 | Project | DetailsRe-ranking multiple action proposals from a generalist policy using a value function at test-time |
Hume | Hume: Introducing System-2 Thinking in Visual-Language-Action Model | Arxiv | 2/6/2025 | Github | DetailsPre-train a value function, perform best-of-N selection of candidate action chunks with state-action value estimation |
Method | Title | Venue | Date | Code/Project | Key feature/finding |
---|---|---|---|---|---|
GRAPE | GRAPE: Generalizing Robot Policy via Preference Alignment | ICLR2025 workshop | 4/2/2025 | Github | DetailsTrajectory-wise Preference Optimization aligns VLA policies on a trajectory level |
SafeVLA | SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning | Arxiv | 31/5/2025 | Project | DetailsConstraining VLA policies via safe reinforcement learning |
Method | Title | Venue | Date | Code/Project | Key feature/finding |
---|---|---|---|---|---|
RPD | Refined Policy Distillation: From VLA Generalists to RL Experts | Arxiv | 6/3/2025 | DetailsLeverage VLA model as policy prior to improve sample-efficiency of RL, as Jump-Start RL |