Agents meet RL is an awesome list that summarizes open-source repositories for training LLM Agents using reinforcement learning:
- 🤖 The criteria for identifying an agent project is that it must have at least one of the following: multi-turn interactions or tool use.
⚠️ This project is based on code analysis from open-source repositories using GitHub Copilot Agent, which may contain unfaithful cases. Although manually reviewed, there may still be omissions. If you find any errors, please don't hesitate to let us know immediately through issues or PRs - we warmly welcome them!- 🤗 We particularly focus on the reinforcement learning frameworks, RL algorithms, rewards, and environments that projects depend on, for everyone's reference on how these excellent open-source projects make their technical choices. Feel free to submit your own projects anytime - we welcome contributions!
Some Enumeration:
- Enumeration for Reward Type:
- External Verifier: e.g., a compiler or math solver
- Simple Rule: e.g., a LaTeX parser with exact match scoring
- Model Based: e.g., a trained verifier LLM or reward LLM
- Custom
Github Repo | Stars | Date | Org | Paper Link | RL Framework | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
---|---|---|---|---|---|---|---|---|---|---|---|---|
siiRL | 2025.7 | Shanghai Innovation Institute | Paper | Custom | PPO/GRPO/CPGD/MARFT | Multi | Both | Multi | LLM/VLM/LLM-MAS PostTraining | Model/Rule | Planned | |
agent-lightning | 2025.6 | Microsoft Research | paper | veRL | PPO/Custom/Automatic Prompt Optimization | Multi | Outcome | Multi | Calculator/SQL | Model/External/Rule | Yes | |
AReaL | 2025.6 | AntGroup/Tsinghua | paper | Custom | PPO | Both | Outcome | Both | Math/Code | External | Yes | |
ROLL | 2025.6 | Alibaba | paper | Custom | PPO/GRPO/Reinforce++/TOPR/RAFT++ | Multi | Both | Multi | Math/QA/Code/Alignment | All | Yes | |
MARTI | 2025.5 | Tsinghua | -- | Custom | PPO/GRPO/REINFORCE++/TTRL | Multi | Both | Multi | Math | All | Yes | |
RL2 | 2025.4 | Accio | – | Custom | Dr. GRPO/PPO/DPO | Single | Both | Both | QA/Dialogue | Rule/Model/External | Yes | |
verifiers | 2025.3 | Individual | -- | HuggingFace | GRPO | Multi | Outcome | Both | Reasoning/Math/Code | All | Code | |
oat | 2024.11 | NUS/Sea AI | paper | Custom | PPO/GRPO | Single | Outcome | Multi | Math/Alignment | External | No | |
veRL | 2024.10 | ByteDance | paper | veRL | PPO/GRPO | Single | Outcome | Both | Math/QA/Reasoning/Search | All | Yes | |
OpenRLHF | 2023.7 | OpenRLHF | paper | OpenRLHF | PPO/REINFORCE++/GRPO/DPO/IPO/KTO/RLOO | Multi | Both | Both | Dialogue/Chat/Completion | Rule/Model/External | Yes | |
trl | 2019.11 | HuggingFace | -- | trl | PPO/GRPO/DPO | Single | Both | Single | QA | Custom | No |
Github Repo | Stars | Date | Org | Paper Link | RL Framework | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ASearcher | 2025.8 | Ant Research RL Lab & Tsinghua University & UW | paper | RealHF/AReaL | PPO/GRPO + Decoupled PPO | Single | Outcome | Multi | Math/Code/SearchQA | External/Rule | Yes | |
Kimi-Researcher | 2025.6 | Moonshot AI | blog | Custom | REINFORCE | Single | Outcome | Multi | Research | Outcome | Search, Browse, Coding | |
TTI | 2025.6 | CMU | paper | Custom | REINFORCE/BC | Single | Outcome | Multi | Web | External | Web Browsing | |
R-Search | 2025.6 | Individual | -- | veRL | PPO/GRPO | Single | Both | Multi | QA/Search | All | Yes | |
R1-Searcher-plus | 2025.5 | RUC | paper | Custom | Custom | Single | Outcome | Multi | Search | Model | Search | |
StepSearch | 2025.5 | SenseTime | paper | veRL | PPO | Single | Process | Multi | QA | Model | Search | |
AutoRefine | 2025.5 | USTC | paper | veRL | PPO/GRPO | Multi | Both | Multi | RAG QA | Rule | Search | |
ZeroSearch | 2025.5 | Alibaba | paper | veRL | PPO/GRPO/REINFORCE | Single | Outcome | Multi | QA/Search | Rule | Yes | |
WebThinker | 2025.4 | RUC | paper | Custom | DPO | Single | Outcome | Multi | Reasoning/QA/Research | Model/External | Web Browsing | |
DeepResearcher | 2025.4 | SJTU | paper | veRL | PPO/GRPO | Multi | Outcome | Multi | Research | All | Yes | |
Search-R1 | 2025.3 | UIUC/Google | paper1, paper2 | veRL | PPO/GRPO | Single | Outcome | Multi | Search | All | Search | |
R1-Searcher | 2025.3 | RUC | paper | OpenRLHF | PPO/DPO | Single | Both | Multi | Search | All | Yes | |
C-3PO | 2025.2 | Alibaba | paper | OpenRLHF | PPO | Multi | Outcome | Multi | Search | Model | Yes | |
WebAgent | 2025.1 | Alibaba | paper1, paper2 | LLaMA-Factory | DAPO | Multi | Process | Multi | Web | Model | Yes |
Github Repo | Stars | Date | Org | Paper Link | RL Framework | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Grounding-R1 | 2025.6 | Salesforce | blog | trl | GRPO | Single | Outcome | Multi | GUI Grounding | Model | Yes | |
AgentCPM-GUI | 2025.6 | OpenBMB/Tsinghua/RUC | paper | Huggingface | GRPO | Single | Outcome | Multi | Mobile GUI | Model | Yes | |
ARPO | 2025.5 | CUHK/HKUST | paper | veRL | GRPO | Single | Outcome | Multi | GUI | External | Computer Use | |
GUI-G1 | 2025.5 | RUC | paper | TRL | GRPO | Single | Outcome | Single | GUI | Rule/External | No | |
GUI-R1 | 2025.4 | CAS/NUS | paper | veRL | GRPO | Single | Outcome | Multi | GUI | Rule | No | |
UI-R1 | 2025.3 | vivo/CUHK | paper | TRL | GRPO | Single | Process | Both | GUI | Rule | Computer/Phone Use |
Github Repo | Stars | Date | Org | Paper Link | RL Framework | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
---|---|---|---|---|---|---|---|---|---|---|---|---|
verl-tool | 2025.6 | TIGER-Lab | X | veRL | PPO/GRPO | Single | Both | Both | Math/Code | Rule/External | Yes | |
Multi-Turn-RL-Agent | 2025.5 | University of Minnesota | paper | Custom | GRPO | Single | Both | Multi | Tool-use/Math | Rule/External | Yes | |
Tool-N1 | 2025.5 | NVIDIA | paper | veRL | PPO | Single | Outcome | Multi | Math/Dialogue | All | Yes | |
Tool-Star | 2025.5 | RUC | paper | LLaMA-Factory | PPO/DPO/ORPO/SimPO/KTO | Single | Outcome | Multi | Multi-modal/Tool Use/Dialogue | Model/External | Yes | |
RL-Factory | 2025.5 | Simple-Efficient | model | veRL | GRPO | Multi | Both | Multi | Tool-use/NL2SQL | All | MCP | |
ReTool | 2025.4 | ByteDance | paper | veRL | PPO | Single | Outcome | Multi | Math | External | Code | |
Agent-R1 | 2025.3 | USTC | -- | veRL | PPO/GRPO | Single | Both | Multi | Tool-use/QA | Model | Yes | |
ReCall | 2025.3 | BaiChuan | paper | veRL | PPO/GRPO/RLOO/REINFORCE++/ReMax | Single | Outcome | Multi | Tool-use/Math/QA | All | Yes |
Github Repo | Stars | Date | Org | Paper Link | RL Framework | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ARIA | 2025.6 | Fudan University | paper | Custom | REINFORCE | Both | Process | Multi | Negotiation/Bargaining | Other | No | |
AMPO | 2025.5 | Tongyi Lab, Alibaba | Paper | veRL | BC/AMPO(GRPO improvement) | Multi | Outcome | Multi | Social Interaction | Model-based | No | |
SPA-RL-Agent | 2025.5 | PolyU | paper | TRL | PPO | Single | Process | Multi | Navigation/Web/TextGame | Model | No | |
Trinity-RFT | 2025.5 | Alibaba | paper | veRL | PPO/GRPO | Single | Outcome | Both | Math/TextGame/Web | All | Yes | |
VAGEN | 2025.3 | RAGEN-AI | paper | veRL | PPO/GRPO | Single | Both | Multi | TextGame/Navigation | All | Yes | |
ART | 2025.3 | OpenPipe | paper | TRL | GRPO | Multi | Both | Multi | TextGame | All | Yes | |
OpenManus-RL | 2025.3 | UIUC/MetaGPT | -- | Custom | PPO/DPO/GRPO | Multi | Outcome | Multi | TextGame | All | Yes | |
RAGEN | 2025.1 | RAGEN-AI | paper | veRL | PPO/GRPO | Single | Both | Multi | TextGame | All | Yes |
Github Repo | Stars | Date | Org | Paper Link | RL Framework | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
---|---|---|---|---|---|---|---|---|---|---|---|---|
MedAgentGym | 2025.6 | Emory/Georgia Tech | paper | Hugginface | SFT/DPO/PPO/GRPO | Single | Outcome | Multi | Medical/Code | External | Yes | |
CURE | 2025.6 | University of Chicago/Princeton/ByteDance | paper | Huggingface | PPO | Single | Outcome | Single | Code | External | No | |
verl-agent | 2025.5 | NTU/Skywork | paper | veRL | PPO/GRPO/GiGPO/DAPO/RLOO/REINFORCE++ | Multi | Both | Multi | Phone Use/Math/Code/Web/TextGame | All | Yes | |
MASLab | 2025.5 | MASWorks | paper | Custom | NO RL | Multi | Outcome | Multi | Code/Math/Reasoning | External | Yes | |
Time-R1 | 2025.5 | UIUC | paper | veRL | PPO/GRPO/DPO | Multi | Outcome | Multi | Temporal | All | Code | |
ML-Agent | 2025.5 | MASWorks | paper | Custom | Custom | Single | Process | Multi | Code | All | Yes | |
SkyRL | 2025.4 | NovaSky | -- | veRL | PPO/GRPO | Single | Outcome | Multi | Math/Code | All | Code | |
digitalhuman | 2025.4 | Tencent | paper | veRL | PPO/GRPO/ReMax/RLOO | Multi | Outcome | Multi | Empathy/Math/Code/MultimodalQA | Rule/Model/External | Yes | |
sweet_rl | 2025.3 | Meta/UCB | paper | OpenRLHF | DPO | Multi | Process | Multi | Design/Code | Model | Web Browsing | |
rllm | 2025.1 | Berkeley Sky Computing Lab / BAIR / Together AI | Notion Blog | veRL | PPO/GRPO | Single | Outcome | Multi | Code Edit | External | Yes | |
open-r1 | 2025.1 | HuggingFace | -- | TRL | GRPO | Single | Outcome | Single | Math/Code | All | Yes |
Github Repo | Stars | Date | Org | Paper Link | RL Framework | RL Algorithm | Single/Multi Agent | Outcome/Process Reward | Single/Multi Turn | Task | Reward Type | Tool usage |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ARPO | 2025.7 | RUC, Kuaishou | paper | veRL | GRPO | Single | Outcome | Multi | Math/Coding | Model/Rule | Yes | |
terminal-bench-rl | 2025.7 | Individual (Danau5tin) | N/A | rLLM | GRPO | Single | Outcome | Multi | Coding/Terminal | Model+External Verifier | Yes | |
MOTIF | 2025.6 | University of Maryland | paper | trl | GRPO | Single | Outcome | Multi | QA | Rule | No | |
MemAgent | 2025.6 | Bytedance, Tsinghua-SIA | paper | veRL | PPO, GRPO, DPO | Multi | Outcome | Multi | Long-context QA | Rule/Model/External | Yes | |
cmriat/l0 | 2025.6 | China Merchants Research Institute of Advanced Technology | paper | veRL | PPO | Multi | Process | Multi | QA | All | Yes | |
agent-distillation | 2025.5 | KAIST | paper | Custom | PPO | Single | Process | Multi | QA/Math | External | Yes | |
VDeepEyes | 2025.5 | Xiaohongshu/XJTU | paper | veRL | PPO/GRPO | Multi | Process | Multi | VQA | All | Yes | |
EasyR1 | 2025.4 | Individual | repo1/paper2 | veRL | GRPO | Single | Process | Multi | Vision-Language | Model | Yes | |
AutoCoA | 2025.3 | BJTU | paper | veRL | GRPO | Multi | Outcome | Multi | Reasoning/Math/QA | All | Yes | |
ToRL | 2025.3 | SJTU | paper | veRL | GRPO | Single | Outcome | Single | Math | Rule/External | Yes | |
ReMA | 2025.3 | SJTU, UCL | paper | veRL | PPO | Multi | Outcome | Multi | Math | Rule | No | |
Agentic-Reasoning | 2025.2 | Oxford | paper | Custom | Custom | Single | Process | Multi | QA/Math | External | Web Browsing | |
SimpleTIR | 2025.2 | NTU, Bytedance | Notion Blog | veRL | PPO/GRPO (with extensions) | Single | Outcome | Multi | Math, Coding | All | Yes | |
openrlhf_async_pipline | 2024.5 | OpenRLHF | paper | OpenRLHF | PPO/REINFORCE++/DPO/RLOO | Single | Outcome | Multi | Dialogue/Reasoning/QA | All | No |
Github Repo | Stars | Date | Org | Task |
---|---|---|---|---|
Mind2Web-2 | 2025.6 | Ohio State University | Web | |
gem | 2025.5 | Sea AI Lab | Math/Code/Game/QA | |
MLE-Dojo | 2025.5 | GIT, Stanford | MLE | |
atropos | 2025.4 | Nous Research | Game/Code/Tool | |
InternBootcamp | 2025.4 | InternBootcamp | Coding/QA/Game | |
reasoning-gym | 2025.1 | open-thought | Algebra/Arithmetic/Computation/Cognition/Geometry/Graph/Logic/Game | |
llmgym | 2025.1 | tensorzero | TextGame/Tool | |
debug-gym | 2024.11 | Microsoft Research | Debugging/Game/Code | |
gym-llm | 2024.8 | Rodrigo Sánchez Molina | Control/Game | |
AgentGym | 2024.6 | Fudan | Web/Game | |
tau-bench | 2024.6 | Sierra | Tool | |
appworld | 2024.6 | Stony Brook University | Phone Use | |
android_world | 2024.5 | Google Research | Phone Use | |
TheAgentCompany | 2024.3 | CMU, Duke | Coding | |
LlamaGym | 2024.3 | Rohan Pandey | Game | |
visualwebarena | 2024.1 | CMU | Web | |
LMRL-Gym | 2023.12 | UC Berkeley | Game | |
OSWorld | 2023.10 | HKU, CMU, Salesforce, Waterloo | Computer Use | |
webarena | 2023.7 | CMU | Web | |
AgentBench | 2023.7 | Tsinghua University | Game/Web/QA/Tool | |
WebShop | 2022.7 | Princeton-NLP | Web | |
ScienceWorld | 2022.3 | AllenAI | TextGame/ScienceQA | |
alfworld | 2020.10 | Microsoft, CMU, UW | Embodied | |
factorio-learning-environment | 2021.6 | JackHopkins | Game | |
jericho | 2018.10 | Microsoft, GIT | TextGame | |
TextWorld | 2018.6 | Microsoft Research | TextGame |