-
Couldn't load subscription status.
- Fork 3.2k
Fix torch_dist checkpointing ETP replica_id #1770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix torch_dist checkpointing ETP replica_id #1770
Conversation
|
Ping on this when you have a chance @sbhavani |
|
Hi @Skylion007 the ETP is for TP splitting of the experts, and it should not be replicated along with it. Could you please paste out the error log with strict load reported in this issue #1836?
Does the TP here means dense TP or expert TP? |
It means ETP since in this context. TP in this PR is ETP because the method is wraped with expert_dist_ckpt_decorator Increasing ETP to 2, 4, 8 etc shows the reduction in the size of the checkpoint by roughly the factor of ETP (for an MOE model which consists mostly of MOE layers). I do not have any relevant log files still around as I last ran the broken code in August, and honestly debugged in by roundtripping the checkpoint before I set log_all Also nit, that decorator should using a TypeVar or typing_extensions.ParamSpec with Callable to prevent type erasure of the method it decorates. |
|
@hxbai I'll try to replicate it on a more recent version of Megatron looking at the checkpoints I do have is a result of this bug. Where it seems to be deleting layers from the model, but the TFLOPs calculation etc all assumes the layers are still there with decoder-last-layers is set and PP is set to 1. Some of the other checkpoints were on the legacy FSDP path and no longer relevant. |
The replica_id is not correct; expert-tensor-parallelism (ETP) would only save the model weights in 1 tensor replica. After this change, the file size of FSDP ETP models, and pipeline parallelism non-ETP models match.