Highlights
Large MoE models support: DeepSeek 671b & Qwen3 235b
Preview features are provided to enable large MoE RL training with Megatron backend, such as DeepSeek 671b documentation. The Megatron backend now supports:
- expert parallelism, context parallelism, gradient checkpointing
- DeepSeek-V3, Qwen3-235b, Mixtral, Moonlight
- dist-ckpt support
Tool-calling, multi-turn RL, SGLang rollout
Sample-level rollout with tool calling and multi-turn RL is supported via SGLang. We provide the Search-R1 recipe built on top of that.
A prototype for sample-level async tool calling is also available with vllm AsyncLLM server.
Multiple enhancements and improvements are made to SGLang rollout, supporting multi-node and multimodal.
Sandbox fusion is integrated.
Low resource friendly
LoRA support is available, enabling 70B+ models on a single node with A100x8 GPUs.
Fused cross entropy kernel to drastically reduce peak memory: actor_rollout_ref.model.use_fused_kernels=True
New models, algorithms and recipes
- Documentation for PPO and GRPO
- Recipe: DAPO
- Recipe: Self-Play Fine-Tuning (SPIN)
- Recipe: Self-Play Preference Optimization (SPPO)
- OPO: On-Policy RL with Optimal Reward Baseline, DrGRPO, REINFORCE++, Dual-Clip PPO
New models and training utils include:
- kimi_vl example
- qwen3 example
- video inputs support
- Warmup-Stable-Decay scheduler
- rope scaling
- evals for GPQA, livecodebench
- logging to ClearML
FSDP2 and training optimizations
FSDP2 is recommended to replace FSDP1, providing better throughput and memory usage, and is composable with other features (e.g. torch.compile):
actor_rollout_ref.ref.strategy=fsdp2
actor_rollout_ref.actor.strategy=fsdp2
critic.strategy=fsdp2
reward_model.strategy=fsdp2
Furthermore, FSDP2 cpu offloading is compatible with gradient accumulation. You can turn it on to save memory with actor_rollout_ref.actor.offload_policy=True
.
Other optimizations include:
- Activation offloading
- ulysses sequence parallelism for vlm
- compute reward during log_prob for ppo trainer
- timeline for ray profiling
Deployment and hardware
- Easy deployment with dstack
- Enhancements to non-nvidia GPUs
Breaking changes and deprecations
- FSDPSFTTrainer now requires the dataset arguments #1282
- SFTDataset and RLHFDataset now take a config as the input #924
- entropy_coeff now defaults to 0 #1770
- FSDP1 support will be dropped in the next release.
- vllm v0.5.4 support will be dropped in the next release.
- A few options are included into the default yaml file, existing script may throw errors such as
+{config}={value}
. Please try removing the + to fix such errors.- ppo_trainer.yaml:
trainer.val_before_train
- sft_trainer.yaml:
data.{prompt,response}_dict_keys
- ppo_trainer.yaml:
verl.utils.reward_score._default_compute_score
is deprecated. Useverl.utils.reward_score.default_compute_score
instead.- the name of ray actor will change from "WorkerDict_xxxx" to "FusedWorker_xxxx", the name of tasks will change from {cls_name}_{method_name}" to "fuw_execute".
New Contributors
@zhao9797 @frederrx @dingyuan-shi @SwordFaith @CJReinforce @linjc16 @wkcn @hijkzzz @JustinTong0323 @mertunsall @Altair-Alpha @czczup @SparkJiao @sunjin-k @tsaoyu @XueruiSu @zhaochenyang20 @NascentAscension @corgilee @lei-lei @pengsun @silverriver @mingruimingrui @Ann-Qin @lilei199908 @YeonwooSung @himalalps @tao-githup @as12138 @thibautbar @aoshen524 @MantasBaksys @YangWang92 @patrik-bartak @mansicer @wangfuchun-fc @survivi @RainBowLuoCS @gzpan @HuaizhengZhang @HollowMan6 @zTonyZhao @lxg2015 @estsauver @jhinpan @yhyang201 @qingquansong @chenhaiq @ShareLer @Artessay @Jackory @swtheing @U-rara @Andrewzh112 @mansoor-s @Necolizer @llkn-2 @yuyuz @linxxx3 @gaokaiz2 @ccchow @ezyang @zw0610 @pavelgein @plutoZZZZ @jybsuper @hebiao064 @GaotangLi @zhangyongxin121 @spacegoing @cedricbeta @Geaming2002 @imh966 @zyzshishui @zzong2006 @langfengQ @zheliuyu @casper-hansen @Bihan @czx6858 @GHGmc2 @DtYXs @thelongestusernameofall @xichengpro @Irvingwangjr @shinytang6 @qyhfrank @mlpod @popomen @liyc-ai @leo-pony @LiuXTao @Lins-01 @yzlnew @vllbc @ZDJeffrey @sukrucildirr @Moyu-42 @YRdddream @jdf-prog @HUGHNew @ElliottYan @NileZhou @shizhediao @rj42 @Crispig @omahs @CurryRice233 @china10s
Thank you for your first contributions!
Full Changelog: v0.3.0.post1...v0.4.0