v0.4.0 release: large MoEs, tool calling, and low resource friendly #1903

eric-haibin-lin · 2025-06-07T05:49:57Z

eric-haibin-lin
Jun 7, 2025
Maintainer

Known issues

summarized in known issues in v0.4 & breaking changes after v0.4 #1902

Highlights

Large MoE models support: DeepSeek 671b & Qwen3 235b

Preview features are provided to enable large MoE RL training with Megatron backend, such as DeepSeek 671b documentation. The Megatron backend now supports:

expert parallelism, context parallelism, gradient checkpointing
DeepSeek-V3, Qwen3-235b, Mixtral, Moonlight
dist-ckpt support

Tool-calling, multi-turn RL, SGLang rollout

Sample-level rollout with tool calling and multi-turn RL is supported via SGLang. We provide the Search-R1 recipe built on top of that.
A prototype for sample-level async tool calling is also available with vllm AsyncLLM server.
Multiple enhancements and improvements are made to SGLang rollout, supporting multi-node and multimodal.
Sandbox fusion is integrated.

Low resource friendly

LoRA support is available, enabling 70B+ models on a single node with A100x8 GPUs.
Fused cross entropy kernel to drastically reduce peak memory: actor_rollout_ref.model.use_fused_kernels=True

New models, algorithms and recipes

Documentation for PPO and GRPO
Recipe: DAPO
Recipe: Self-Play Fine-Tuning (SPIN)
Recipe: Self-Play Preference Optimization (SPPO)
OPO: On-Policy RL with Optimal Reward Baseline, DrGRPO, REINFORCE++, Dual-Clip PPO

New models and training utils include:

kimi_vl example
qwen3 example
video inputs support
Warmup-Stable-Decay scheduler
rope scaling
evals for GPQA, livecodebench
logging to ClearML

FSDP2 and training optimizations

FSDP2 is recommended to replace FSDP1, providing better throughput and memory usage, and is composable with other features (e.g. torch.compile):

actor_rollout_ref.ref.strategy=fsdp2
actor_rollout_ref.actor.strategy=fsdp2
critic.strategy=fsdp2 
reward_model.strategy=fsdp2

Furthermore, FSDP2 cpu offloading is compatible with gradient accumulation. You can turn it on to save memory with actor_rollout_ref.actor.offload_policy=True.

Other optimizations include:

Activation offloading
ulysses sequence parallelism for vlm
compute reward during log_prob for ppo trainer
timeline for ray profiling

Deployment and hardware

Easy deployment with dstack
Enhancements to non-nvidia GPUs

Breaking changes and deprecations

FSDPSFTTrainer now requires the dataset arguments [trainer] breaking: pass dataset as required args to SFTTrainer; also change ppo ray trainer to take custom datasets as inputs #1282
SFTDataset and RLHFDataset now take a config as the input feat: support customized datasets for RayPPOTrainer #924
entropy_coeff now defaults to 0 [BREAKING] config: set the default value of actor.entropy_coeff to 0 #1770
FSDP1 support will be dropped in the next release.
vllm v0.5.4 support will be dropped in the next release.
A few options are included into the default yaml file, existing script may throw errors such as +{config}={value}. Please try removing the + to fix such errors.
- ppo_trainer.yaml: trainer.val_before_train
- sft_trainer.yaml: data.{prompt,response}_dict_keys
verl.utils.reward_score._default_compute_score is deprecated. Use verl.utils.reward_score.default_compute_score instead.
the name of ray actor will change from "WorkerDict_xxxx" to "FusedWorker_xxxx", the name of tasks will change from {cls_name}_{method_name}" to "fuw_execute".

New Contributors

@zhao9797 @frederrx @dingyuan-shi @SwordFaith @CJReinforce @linjc16 @wkcn @hijkzzz @JustinTong0323 @mertunsall @Altair-Alpha @czczup @SparkJiao @sunjin-k @tsaoyu @XueruiSu @zhaochenyang20 @NascentAscension @corgilee @lei-lei @pengsun @silverriver @mingruimingrui @Ann-Qin @lilei199908 @YeonwooSung @himalalps @tao-githup @as12138 @thibautbar @aoshen524 @MantasBaksys @YangWang92 @patrik-bartak @mansicer @wangfuchun-fc @survivi @RainBowLuoCS @gzpan @HuaizhengZhang @HollowMan6 @zTonyZhao @lxg2015 @estsauver @jhinpan @yhyang201 @qingquansong @chenhaiq @ShareLer @Artessay @Jackory @swtheing @U-rara @Andrewzh112 @mansoor-s @Necolizer @llkn-2 @yuyuz @linxxx3 @gaokaiz2 @ccchow @ezyang @zw0610 @pavelgein @plutoZZZZ @jybsuper @hebiao064 @GaotangLi @zhangyongxin121 @spacegoing @cedricbeta @Geaming2002 @imh966 @zyzshishui @zzong2006 @langfengQ @zheliuyu @casper-hansen @Bihan @czx6858 @GHGmc2 @DtYXs @thelongestusernameofall @xichengpro @Irvingwangjr @shinytang6 @qyhfrank @mlpod @popomen @liyc-ai @leo-pony @LiuXTao @Lins-01 @yzlnew @vllbc @ZDJeffrey @sukrucildirr @Moyu-42 @YRdddream @jdf-prog @HUGHNew @ElliottYan @NileZhou @shizhediao @rj42 @Crispig @omahs @CurryRice233 @china10s
Thank you for your first contributions!

Full Changelog: v0.3.0.post1...v0.4.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.4.0 release: large MoEs, tool calling, and low resource friendly #1903

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

v0.4.0 release: large MoEs, tool calling, and low resource friendly #1903

Uh oh!

Uh oh!

eric-haibin-lin Jun 7, 2025 Maintainer

Known issues

Highlights

Large MoE models support: DeepSeek 671b & Qwen3 235b

Tool-calling, multi-turn RL, SGLang rollout

Low resource friendly

New models, algorithms and recipes

FSDP2 and training optimizations

Deployment and hardware

Breaking changes and deprecations

New Contributors

Replies: 0 comments

eric-haibin-lin
Jun 7, 2025
Maintainer