Skip to content

v0.4.0 release: large MoEs, tool calling, and low resource friendly

Latest
Compare
Choose a tag to compare
@eric-haibin-lin eric-haibin-lin released this 06 Jun 23:55

Highlights

Large MoE models support: DeepSeek 671b & Qwen3 235b

Preview features are provided to enable large MoE RL training with Megatron backend, such as DeepSeek 671b documentation. The Megatron backend now supports:

  • expert parallelism, context parallelism, gradient checkpointing
  • DeepSeek-V3, Qwen3-235b, Mixtral, Moonlight
  • dist-ckpt support

Tool-calling, multi-turn RL, SGLang rollout

Sample-level rollout with tool calling and multi-turn RL is supported via SGLang. We provide the Search-R1 recipe built on top of that.
A prototype for sample-level async tool calling is also available with vllm AsyncLLM server.
Multiple enhancements and improvements are made to SGLang rollout, supporting multi-node and multimodal.
Sandbox fusion is integrated.

Low resource friendly

LoRA support is available, enabling 70B+ models on a single node with A100x8 GPUs.
Fused cross entropy kernel to drastically reduce peak memory: actor_rollout_ref.model.use_fused_kernels=True

New models, algorithms and recipes

New models and training utils include:

FSDP2 and training optimizations

FSDP2 is recommended to replace FSDP1, providing better throughput and memory usage, and is composable with other features (e.g. torch.compile):

actor_rollout_ref.ref.strategy=fsdp2
actor_rollout_ref.actor.strategy=fsdp2
critic.strategy=fsdp2 
reward_model.strategy=fsdp2 

Furthermore, FSDP2 cpu offloading is compatible with gradient accumulation. You can turn it on to save memory with actor_rollout_ref.actor.offload_policy=True.

Other optimizations include:

Deployment and hardware

  • Easy deployment with dstack
  • Enhancements to non-nvidia GPUs

Breaking changes and deprecations

  • FSDPSFTTrainer now requires the dataset arguments #1282
  • SFTDataset and RLHFDataset now take a config as the input #924
  • entropy_coeff now defaults to 0 #1770
  • FSDP1 support will be dropped in the next release.
  • vllm v0.5.4 support will be dropped in the next release.
  • A few options are included into the default yaml file, existing script may throw errors such as +{config}={value}. Please try removing the + to fix such errors.
    • ppo_trainer.yaml: trainer.val_before_train
    • sft_trainer.yaml: data.{prompt,response}_dict_keys
  • verl.utils.reward_score._default_compute_score is deprecated. Use verl.utils.reward_score.default_compute_score instead.
  • the name of ray actor will change from "WorkerDict_xxxx" to "FusedWorker_xxxx", the name of tasks will change from {cls_name}_{method_name}" to "fuw_execute".

New Contributors

@zhao9797 @frederrx @dingyuan-shi @SwordFaith @CJReinforce @linjc16 @wkcn @hijkzzz @JustinTong0323 @mertunsall @Altair-Alpha @czczup @SparkJiao @sunjin-k @tsaoyu @XueruiSu @zhaochenyang20 @NascentAscension @corgilee @lei-lei @pengsun @silverriver @mingruimingrui @Ann-Qin @lilei199908 @YeonwooSung @himalalps @tao-githup @as12138 @thibautbar @aoshen524 @MantasBaksys @YangWang92 @patrik-bartak @mansicer @wangfuchun-fc @survivi @RainBowLuoCS @gzpan @HuaizhengZhang @HollowMan6 @zTonyZhao @lxg2015 @estsauver @jhinpan @yhyang201 @qingquansong @chenhaiq @ShareLer @Artessay @Jackory @swtheing @U-rara @Andrewzh112 @mansoor-s @Necolizer @llkn-2 @yuyuz @linxxx3 @gaokaiz2 @ccchow @ezyang @zw0610 @pavelgein @plutoZZZZ @jybsuper @hebiao064 @GaotangLi @zhangyongxin121 @spacegoing @cedricbeta @Geaming2002 @imh966 @zyzshishui @zzong2006 @langfengQ @zheliuyu @casper-hansen @Bihan @czx6858 @GHGmc2 @DtYXs @thelongestusernameofall @xichengpro @Irvingwangjr @shinytang6 @qyhfrank @mlpod @popomen @liyc-ai @leo-pony @LiuXTao @Lins-01 @yzlnew @vllbc @ZDJeffrey @sukrucildirr @Moyu-42 @YRdddream @jdf-prog @HUGHNew @ElliottYan @NileZhou @shizhediao @rj42 @Crispig @omahs @CurryRice233 @china10s
Thank you for your first contributions!

Full Changelog: v0.3.0.post1...v0.4.0