v0.5.0: agentic RL rollout, prototypes for disaggregated async training & GenerativeRM, better rollout load balance & improved sglang+megatron/vlm support
LatestHighlights
Agentic RL rollout interface [beta]
verl v0.5 introduces the AgentLoop abstraction that allows easy extension to custom rollout with tool/agent interactions. Server-based asynchronous rollout is adopted to efficiently utilize GPUs. verl provides a few example agent loop implementations including:
- Multi-turn conversations and tool calls
- LangGraph-based Agent
Please check the documentation for the system architecture design.
Disaggregated placement & async training [prototype]
verl v0.5 includes a community-contributed one-step-off async training recipe, with trainer and rollout deployed on disaggregated resources and off-policy model updates with staleness = 1. In a small scale experiment, the reference recipe provides 20-40% throughput gain compared to the on-policy baseline depending on the configuration. Please checkout the code and documentation for example configurations.
Remote generative reward models [prototype]
A recipe is provided as a prototype to demonstrate the recommended way to use generative reward models in verl. Documentation and code.
New features
- LoRA RL support for VLMs: #2182
- Better checkpoint manager support for SFT trainer #2292
- Support rollout trajectory tracing and RolloutViewer with improved debug-ability and visualization
- Megatron with mbridge integration, which better supports hf model loading into megatron #2064
Important fixes & improvements
- Fixed an issue with FSDP2 state_dict memory usage caused by torch 2.6. Either using verl v0.5 or torch 2.7 avoids OOMs #2606
- Significantly reduced the overhead of vllm async server performance (v.s. vllm engine) #2246
- Fixed sglang + Megatron TP16 #2336
- Improved SGLang + Megatron weight resharding by 10x #2418 and MoE weight resharding by 3x #2692
- Significant rollout load balancing for GRPO-like algorithms via repeating samples before dispatching them #2324
Breaking changes and deprecations
Full list: #2270
Rollout
-
When generate_sequences with sampling params n>1, change DataProto repeat behavior:
- chunk-dispatch-repeat: DataProto is chunked and dispatched to rollout workers, then repeated in rollout workers.
- repeat-chunk-dispatch: DataProto is repeated by n in driver, then chunked and dispatched to rollout workers.
Switch fromchunk-dispatch-repeat
torepeat-chunk-dispatch
, this change may break almost all recipes and projects using verl GRPO as submodules. #2324
-
verl.workers.rollout.sglang_rollout.
AsyncSglangServer
is now renamed asAsyncSGLangServer
-
vllm <= v0.6 support is dropped
Multi-turn
- We are moving multi-turn supports from ChatScheduler to AgentLoop to improve usability. #2124
Megatron
- Megatron recomputation options are moved to
*.megatron.override_transformer_config
. #2651 Default values are:
override_transformer_config:
recompute_granularity: null
recompute_modules:
- core_attn
recompute_method: null
recompute_num_layers: null
- Merged config
actor_rollout_ref.(actor, ref, rollout).profiler
toactor_rollout_ref.profiler
What's Changed
Trainer & FSDP
- [fsdp] fix: Change the data in the update_actor function from to.('cpu') to to.(get_device_id()) by @Keilo001 in #2477
- [fsdp] fix: vlm dynamic batch & unify dynamic batch api by @hiyouga in #2524
- [fsdp] fix: change geo3k model name from non-vl to vl by @nanjiangwill in #2555
- [trainer, recipe] feat: add support for external generative reward models by @yyDing1 in #2121
- [trainer] fix: fix split placement by @vermouth1992 in #2227
- [trainer, vllm] feat: add lora exclude_modules to support VL model lora training by @Cccei000 in #2182
- [trainer] fix: pre-commit broken by #2354 by @ETOgaosion in #2358
- [trainer, cfg] feat: add BaseConfig for all dataclass configs. Introduce dataclass for algorithm related configs by @eric-haibin-lin in https://github.com/
- [trainer] fix: Use safe masked mean/sum to handle NaN values outside the mask by @Yangruipis in #2377
- [trainer, data] feat: Dynamic Data Generation by @jwong8314 in #2312
- [trainer] fix: use .keys() to check 'response_mask' in TensorDict by @askender in #2491
- [trainer] fix: Allow FSDP2 when doing strategy check by @HollowMan6 in #2497
- [trainer] refactor: no need to call load_reward_manager in compute_reward_async by @eric-haibin-lin in #2557
- [trainer, fsdp, vllm, recipe] feat: one step off async training recipe by @imh966 in #2231
- [trainer] fix: maybe_filter_out_long_prompts on image and video by @firefighter-eric in #2553
- [trainer] refactor: Training Engine Interface and Development Plan by @ZihengJiang in #1977
- [trainer] feat: Add FSDPCheckpointManager for SFTtrainer, support resume training, manage the number of CKPTS in keep by @Pursuer-Hsf in #2292
Rollout & SGLang
- [rollout] feat: add agent loop by @wuxibin89 in #2124
- [rollout] feat: add zeromq vllm distributed executor by @wuxibin89 in #2246
- [BREAKING][rollout] refactor: drop vllm v0.5.4 and v0.6.3 support by @eric-haibin-lin in #2257
- [rollout] feat: Allow customization of async server class by @ultmaster in #2326
- [rollout] fix: fix hf rollout and add single gpu test by @eric-haibin-lin in #2371
- [BREAKING][rollout] feat: repeat DataProto when n>1 in driver instead of rollout workers by @wuxibin89 in #2324
- [misc] feat: trace rollout generation and tool calls using weave by @chenhaiq in #2345
- [cfg] refactor: make the rollout & ref configs more modular by @eric-haibin-lin in #2410
- [perf] feat: add range tag to start/stop profile; clean actor_rollout_ref.profiler by @davidmlw in #2456
- [rollout] feat: support mlflow in rollout trace by @chenhaiq in #2440
- [rollout] feat: add ReactAgentLoop based on LangGraph by @wuxibin89 in #2463
- [rollout] fix: fix bug for remax when the rollout mode is async by @none0663 in #2574
- [tool] chore: introduce RolloutViewer TUI tools by @Yangruipis in #2469
- [rollout,vllm] fix: A major issue in random sampling of vllm engine by @guanning03 in #2646
- [tool] chore: Add log for AsyncRolloutRequest ID, and rollout viewr to support request id display and search by @Hecate0821 in https://github.com/volcengine/
- [rollout] fix: use flashattn3 backend in sglang to avoid error in tool call by @chenhaiq in #2244
- [rollout] fix: Make
free_cache_engine
option workable in latest vLLM/SGLang by @HollowMan6 in #1464 - [rollout] fix: #1646 stop words for sglang rollout by @linxxx3 in #1991
- [sglang, rollout] refactor: use torch.Tensor in async rollout schemas by @nanjiangwill in #2362
- [rollout] fix: sglang async fail with Multi-stage Awake feature by @chenhaiq in #2365
- [sglang] feat: Add multi-interaction registry support and testing by @SwordFaith in #2184
- [sglang] feat: Repeat sampling parameter n into requests of GRPO in SGLang by @zhaochenyang20 in #2258
- [sglang,tool] feat: Add support for tools that generate multimodal data by @nanjiangwill in #2146
- [sglang] fix: only wake up weights on infer_tp 0 by @zhaochenyang20 in #2403
- [sglang] fix: Import Error in the latest sglang by @yyDing1 in #2275
- [sglang] fix: Fix qwen2vl weight keys issue by @hebiao064 in #2434
- [sglang] fix: Only flush cache on TP rank=0. by @SuperCB in #2455
- [sglang] feat: update weights in batch with FSDP by @zhaochenyang20 in #2559
- [sglang] fix: adding missing param for sgl async unit test by @zhaochenyang20 in #2561
- [sglang] fix: update response handling and scoring method in GSM8K interaction by @aaronyeeio in #2428
- [sglang] fix: rename Sglang to SGLang following SGLang's fashion by @zhaochenyang20 in #2672
- [sglang] fix: Bug in megatron+sglang TP16 update_weights. by @SuperCB in #2336
- [sglang, megatron, perf] feat: speed up megatron sglang weight update by 10x by @Yangruipis in #2418
- [megatron] fix: wrong response_mask for megatron + sglang mutli-turn by @Yangruipis in #2543
Megatron
- [megatron] feat: add megatron memory log by @ETOgaosion in #2272
- [megatron] feat: use mbridge as megatron adaptor by @ISEEKYAN in #2064
- [megatron] fix: optimizer scheduler misalignment with FSDP by @ETOgaosion in #2303
- [cfg] refactor: split fsdp/megatron specific configs, consolidate shared ones for reward_model and critic by @eric-haibin-lin in https://github.com/volcengine
- [megatron] feat: fused kernel lightweight by @ISEEKYAN in #2210
- [megatron] feat: allow override DistributedDataParallelConfig by @ETOgaosion in #2523
- [data, megatron] feat: add dynamic batching computational workload balance by @conver334 in #2452
- [megatron] feat: support distributed megatron model converter and merger by @Yangruipis in #2281
- [cfg] refactor: add flatten megatron trainer config generation and verification script by @eric-haibin-lin in #2582
- [BREAKING][megatron] refactor: activation checkpointing APIs by @ETOgaosion in #2651
- [megatron] fix: CUDA_DEVICE_MAX_CONNECTIONS not taking effect by @ETOgaosion in #2687
Hardware
- [hardware] feat: support ray actor sharing situation on ASCEND NPU by @FightingZhen in #2341
- [Hardware] feat: Support AMD (ROCMm Kernel) - Update Dockerfile/Docker Image by @yushengsu-thu in #2390
- [hardware] fix: enable sleep mode on ASCEND NPU by @as12138 in #2459
- [hardward] chore: Enable Generation of Wheel File During Docker Build by @rhiremat in #2332
Misc fixes
- [ckpt] feat: support esi by @plutoZZZZ in #2192
- [model] fix: separate minicpmo data by @hiyouga in #2212
- [misc] chore: pin transformers under 4.53 by @hiyouga in #2241
- [worker] fix: OOM on first iteration in multi-turn RL by @zTonyZhao in #2253
- [algo] fix: correctly aggregate kl metrics in PPO actor by @0x404 in #2259
- [recipe] feat: add retool recipe by @wuxibin89 in #2233
- [cfg] fix: Security Enhancement Block Dangerous Modules in Sandbox Environment by @none0663 in #2170
- [cfg] chore: add non-negative expected_len assertion by @LeavesLei in #2330
- [algo] feat: mask out observation token in GAE by @wuxibin89 in #2337
- [tool] fix: avoid exception when sandbox return None by @chenhaiq in #2346
- [perf] feat: support entropy checkpointing without rmpad or sp by @FightingZhen in #2342
- [ckpt] fix: edit esi doc by @plutoZZZZ in #2354
- [docker] refactor: Migrate images to verlai, support latest flash attention and newer CUDA versions in future by @ETOgaosion in https://github.com/volcengine/verl/pull/2085volcengine/verl/pull/2147
- [data] feat: add interface for user-defined curriculum sampler by @frrad in #2314
- [cfg] fix: pickleing error in multiprocessing in the reward_fn by @none0663 in #2239
- [ray] refactor: Seperate the constants into different file by @YeonwooSung in #2025
- [misc] refactor: replace pkg_resources with importlib.metadata by @askender in #2392
- [tool] fix: Add MCP usage documentation by @AlecHenx in #2261
- [cfg] refactor: make actor config more modular by @eric-haibin-lin in #2379
- [misc] fix: huggingface model config max_position_embeddings assertion for model with extended context length by @Wangmerlyn in #737
- [data] refactor: move sampler api to experimental by @eric-haibin-lin in #2381
- [perf] feat: add npu profiler for FSDP backend by @tongtong0613 in #2194
- [misc] refactor: Replace deepcopy with tensor.clone by @ji-huazhong in #2442
- [misc] fix: add *.yaml to pyproject due to modular config by @nanjiangwill in #2468
- [misc] feat: add py.typed file to
verl/
by @frrad in #2467 - [env] feat: upgrade tensordict version by @vermouth1992 in #2460
- [docker] feat: provide images with deepep by @ETOgaosion in #2480
- [training_utils] feat: log_generations_to_swanlab use table by @Zeyi-Lin in #2489
- [env] feat: safely bump py version to 3.10 by @Tavish9 in #2421
- [BUG] fix bug for #2506, when passing as response_mask to policy_loss_fn by @none0663 in #2513
- [single_controller] fix: replace unittest.mock.patch with context manager for env var handling by @PeterSH6 in #2498
- [recipe] fix: DAPO rewards using sandbox fusion by @HollowMan6 in #2496
- [cfg] refactor: support +extra.any_key usage for the base dataclass config in verl by @eric-haibin-lin in #2502
- [ray] refactor: Use public method to get node IP by @kevin85421 in #2521
- [env] fix: bump tensordict to 0.9.1 by @ultmaster in #2541
- [data] fix: Add missing init files in verl experimental data folders by @JoostvDoorn in #2548
- [ray] fix: strip [] for ipv6 address by @wuxibin89 in #2545
- [tool] fix: correctly convert 'None' to null in sandbox fusion _process_single_case by @mathewjhan in #2409
- [training_utils] fix: uneven support in split by @ultmaster in #2560
- [perf] feat: Clip gsm8k solution string to optimize reward calculation by @PopSoda2002 in #2568
- set use_kl_in_reward=True in reinforce_plus_plus by @Titanpku in #2580
- [cfg] feat: add critic config class by @eric-haibin-lin in #2583
- [tool] fix: supports variable arguments for marked_timer by @tardis-key in #2576
- [single_controller] fix: padding for kwargs by @ShareLer in #2585
- [docker] fix: downgrade TransformerEngine version 2.2.1 to allow mcore image using rope fusion and provide another set of v0.5 image by @ETOgaosion in https://github.com/volcengine/verl/pull/2611volcengine/verl/pull/2292
- [recipe] feat: add QWen 30b moe dapo script that can run on a single 80GB node by @vermouth1992 in https://github.com/volcengine/verl/pull/2645verl/pull/2636
- [perf] feat: mistral and gemma3_text mfu compute support by @xihuai18 in #2622
- [misc] fix: fix prompt and response key in gemma7b example by @apeforest in #2610
- [data, recipe] fix: remove redundant json parsing by @zhxieml in #2671
New Contributors
Welcome new contributors to the verl community! @rhiremat @LeavesLei @diqiuzhuanzhuan @frrad @shuyhere @askender @Tavish9 @Wangmerlyn @SuperCB @tongtong0613 @jwong8314 @ji-huazhong @Keilo001 @conver334 @JoostvDoorn @mathewjhan @PopSoda2002 @rudeigerc @Titanpku @firefighter-eric @meituan-search @xihuai18 @tardis-key @ZihengJiang @Pursuer-Hsf @beep-bebop @aaronyeeio @Hecate0821 @apeforest @zhxieml
Full Changelog: v0.4.1...v0.5.0