[RFC] modular engine interface for training backends (FSDP, FSDP2, torchtitan, Megatron, PAI-Megatron, etc) #1560
Replies: 4 comments
-
|
cc @wwwjn |
Beta Was this translation helpful? Give feedback.
-
|
Hi @eric-haibin-lin , I would love to learn the gap to make making FSDP2 the default choice for OSS users. We can support from pytorch side, in case OSS users hit issues |
Beta Was this translation helpful? Give feedback.
-
No obvious blocker for now. We'll promote fsdp2 and ask more users to try, making sure it's stable before switching to fsdp2 by default |
Beta Was this translation helpful? Give feedback.
-
please help retweet :) https://x.com/verl_project/status/1920656559237198140 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
To RFC requesters: Please take a look at previous RFCs for reference.
Motivation
Terms:
Currently, the training engines (Megatron/FSDP) are coupled with RL roles (e.g actor/critic). This makes it hard to write unit test for each module. Ideally for FSDP/Megatron engines, for development we want to have standalone test that simply runs forward/loss/backward/update. To achieve the separation of concerns, the engines better be a standalone module separated from
ActorRolloutRefWorker. This way it would be much easier for the community to integrate different training backends, such as internal Megatron forks, torchtitan or other parallel training engines.On the other hand, in the codebase there's repeated code used to wrap modules with fsdp or megatron for each role (e.g. reward, critic). To achieve DRY, we need to modulize distributed model creation and put FSDP/Megatron model fwd/bwd code into a single class.
Proposed Changes
Refactor actors (e.g. actor.py) agnostic to training engine backends, while training engines expose the same interface to actors.
Engine Interfaces
FSDP
Megatron, following the same interface:
Execution plan:
The following plan would involve gradual changes, less risky in causing regression or conflicts.
and update all sft example script with main_sft.py
In theory, once (1) is done, we can enable parallel efforts such as integrating fsdp2+other parallelism to the engine code, or enable engines for non-nvidia GPUs.
Feedback Period
5/2 - 5/9
CC List
Credit to ZR, @vermouth1992 @Frag17 and the verl team
cc @ETOgaosion @ccclyu @wconstab @lxg2015 @mori360 @weifengpy @PeterSH6 @yushengsu-thu @Chendong98 @as12138
Any Other Things
Any other things you would like to mention.
Beta Was this translation helpful? Give feedback.
All reactions