Skip to content

v0.5.0: agentic RL rollout, prototypes for disaggregated async training & GenerativeRM, better rollout load balance & improved sglang+megatron/vlm support

Latest
Compare
Choose a tag to compare
@eric-haibin-lin eric-haibin-lin released this 23 Jul 18:20
· 56 commits to main since this release
8fdc4d3

Highlights

Agentic RL rollout interface [beta]

verl v0.5 introduces the AgentLoop abstraction that allows easy extension to custom rollout with tool/agent interactions. Server-based asynchronous rollout is adopted to efficiently utilize GPUs. verl provides a few example agent loop implementations including:

Please check the documentation for the system architecture design.

Disaggregated placement & async training [prototype]

verl v0.5 includes a community-contributed one-step-off async training recipe, with trainer and rollout deployed on disaggregated resources and off-policy model updates with staleness = 1. In a small scale experiment, the reference recipe provides 20-40% throughput gain compared to the on-policy baseline depending on the configuration. Please checkout the code and documentation for example configurations.

Remote generative reward models [prototype]

A recipe is provided as a prototype to demonstrate the recommended way to use generative reward models in verl. Documentation and code.

New features

  • LoRA RL support for VLMs: #2182
  • Better checkpoint manager support for SFT trainer #2292
  • Support rollout trajectory tracing and RolloutViewer with improved debug-ability and visualization
  • Megatron with mbridge integration, which better supports hf model loading into megatron #2064

Important fixes & improvements

  • Fixed an issue with FSDP2 state_dict memory usage caused by torch 2.6. Either using verl v0.5 or torch 2.7 avoids OOMs #2606
  • Significantly reduced the overhead of vllm async server performance (v.s. vllm engine) #2246
  • Fixed sglang + Megatron TP16 #2336
  • Improved SGLang + Megatron weight resharding by 10x #2418 and MoE weight resharding by 3x #2692
  • Significant rollout load balancing for GRPO-like algorithms via repeating samples before dispatching them #2324

Breaking changes and deprecations

Full list: #2270

Rollout

  • When generate_sequences with sampling params n>1, change DataProto repeat behavior:

    • chunk-dispatch-repeat: DataProto is chunked and dispatched to rollout workers, then repeated in rollout workers.
    • repeat-chunk-dispatch: DataProto is repeated by n in driver, then chunked and dispatched to rollout workers.
      Switch from chunk-dispatch-repeat to repeat-chunk-dispatch, this change may break almost all recipes and projects using verl GRPO as submodules. #2324
  • verl.workers.rollout.sglang_rollout.AsyncSglangServer is now renamed as AsyncSGLangServer

  • vllm <= v0.6 support is dropped

Multi-turn

  • We are moving multi-turn supports from ChatScheduler to AgentLoop to improve usability. #2124

Megatron

  • Megatron recomputation options are moved to *.megatron.override_transformer_config. #2651 Default values are:
override_transformer_config:
  recompute_granularity: null
  recompute_modules:
  - core_attn
  recompute_method: null
  recompute_num_layers: null
  • Merged config actor_rollout_ref.(actor, ref, rollout).profiler to actor_rollout_ref.profiler

What's Changed

Trainer & FSDP

  • [fsdp] fix: Change the data in the update_actor function from to.('cpu') to to.(get_device_id()) by @Keilo001 in #2477
  • [fsdp] fix: vlm dynamic batch & unify dynamic batch api by @hiyouga in #2524
  • [fsdp] fix: change geo3k model name from non-vl to vl by @nanjiangwill in #2555
  • [trainer, recipe] feat: add support for external generative reward models by @yyDing1 in #2121
  • [trainer] fix: fix split placement by @vermouth1992 in #2227
  • [trainer, vllm] feat: add lora exclude_modules to support VL model lora training by @Cccei000 in #2182
  • [trainer] fix: pre-commit broken by #2354 by @ETOgaosion in #2358
  • [trainer, cfg] feat: add BaseConfig for all dataclass configs. Introduce dataclass for algorithm related configs by @eric-haibin-lin in https://github.com/
  • [trainer] fix: Use safe masked mean/sum to handle NaN values outside the mask by @Yangruipis in #2377
  • [trainer, data] feat: Dynamic Data Generation by @jwong8314 in #2312
  • [trainer] fix: use .keys() to check 'response_mask' in TensorDict by @askender in #2491
  • [trainer] fix: Allow FSDP2 when doing strategy check by @HollowMan6 in #2497
  • [trainer] refactor: no need to call load_reward_manager in compute_reward_async by @eric-haibin-lin in #2557
  • [trainer, fsdp, vllm, recipe] feat: one step off async training recipe by @imh966 in #2231
  • [trainer] fix: maybe_filter_out_long_prompts on image and video by @firefighter-eric in #2553
  • [trainer] refactor: Training Engine Interface and Development Plan by @ZihengJiang in #1977
  • [trainer] feat: Add FSDPCheckpointManager for SFTtrainer, support resume training, manage the number of CKPTS in keep by @Pursuer-Hsf in #2292

Rollout & SGLang

  • [rollout] feat: add agent loop by @wuxibin89 in #2124
  • [rollout] feat: add zeromq vllm distributed executor by @wuxibin89 in #2246
  • [BREAKING][rollout] refactor: drop vllm v0.5.4 and v0.6.3 support by @eric-haibin-lin in #2257
  • [rollout] feat: Allow customization of async server class by @ultmaster in #2326
  • [rollout] fix: fix hf rollout and add single gpu test by @eric-haibin-lin in #2371
  • [BREAKING][rollout] feat: repeat DataProto when n>1 in driver instead of rollout workers by @wuxibin89 in #2324
  • [misc] feat: trace rollout generation and tool calls using weave by @chenhaiq in #2345
  • [cfg] refactor: make the rollout & ref configs more modular by @eric-haibin-lin in #2410
  • [perf] feat: add range tag to start/stop profile; clean actor_rollout_ref.profiler by @davidmlw in #2456
  • [rollout] feat: support mlflow in rollout trace by @chenhaiq in #2440
  • [rollout] feat: add ReactAgentLoop based on LangGraph by @wuxibin89 in #2463
  • [rollout] fix: fix bug for remax when the rollout mode is async by @none0663 in #2574
  • [tool] chore: introduce RolloutViewer TUI tools by @Yangruipis in #2469
  • [rollout,vllm] fix: A major issue in random sampling of vllm engine by @guanning03 in #2646
  • [tool] chore: Add log for AsyncRolloutRequest ID, and rollout viewr to support request id display and search by @Hecate0821 in https://github.com/volcengine/
  • [rollout] fix: use flashattn3 backend in sglang to avoid error in tool call by @chenhaiq in #2244
  • [rollout] fix: Make free_cache_engine option workable in latest vLLM/SGLang by @HollowMan6 in #1464
  • [rollout] fix: #1646 stop words for sglang rollout by @linxxx3 in #1991
  • [sglang, rollout] refactor: use torch.Tensor in async rollout schemas by @nanjiangwill in #2362
  • [rollout] fix: sglang async fail with Multi-stage Awake feature by @chenhaiq in #2365
  • [sglang] feat: Add multi-interaction registry support and testing by @SwordFaith in #2184
  • [sglang] feat: Repeat sampling parameter n into requests of GRPO in SGLang by @zhaochenyang20 in #2258
  • [sglang,tool] feat: Add support for tools that generate multimodal data by @nanjiangwill in #2146
  • [sglang] fix: only wake up weights on infer_tp 0 by @zhaochenyang20 in #2403
  • [sglang] fix: Import Error in the latest sglang by @yyDing1 in #2275
  • [sglang] fix: Fix qwen2vl weight keys issue by @hebiao064 in #2434
  • [sglang] fix: Only flush cache on TP rank=0. by @SuperCB in #2455
  • [sglang] feat: update weights in batch with FSDP by @zhaochenyang20 in #2559
  • [sglang] fix: adding missing param for sgl async unit test by @zhaochenyang20 in #2561
  • [sglang] fix: update response handling and scoring method in GSM8K interaction by @aaronyeeio in #2428
  • [sglang] fix: rename Sglang to SGLang following SGLang's fashion by @zhaochenyang20 in #2672
  • [sglang] fix: Bug in megatron+sglang TP16 update_weights. by @SuperCB in #2336
  • [sglang, megatron, perf] feat: speed up megatron sglang weight update by 10x by @Yangruipis in #2418
  • [megatron] fix: wrong response_mask for megatron + sglang mutli-turn by @Yangruipis in #2543

Megatron

Hardware

  • [hardware] feat: support ray actor sharing situation on ASCEND NPU by @FightingZhen in #2341
  • [Hardware] feat: Support AMD (ROCMm Kernel) - Update Dockerfile/Docker Image by @yushengsu-thu in #2390
  • [hardware] fix: enable sleep mode on ASCEND NPU by @as12138 in #2459
  • [hardward] chore: Enable Generation of Wheel File During Docker Build by @rhiremat in #2332

Misc fixes

New Contributors

Welcome new contributors to the verl community! @rhiremat @LeavesLei @diqiuzhuanzhuan @frrad @shuyhere @askender @Tavish9 @Wangmerlyn @SuperCB @tongtong0613 @jwong8314 @ji-huazhong @Keilo001 @conver334 @JoostvDoorn @mathewjhan @PopSoda2002 @rudeigerc @Titanpku @firefighter-eric @meituan-search @xihuai18 @tardis-key @ZihengJiang @Pursuer-Hsf @beep-bebop @aaronyeeio @Hecate0821 @apeforest @zhxieml

Full Changelog: v0.4.1...v0.5.0