Fix torch_dist checkpointing ETP replica_id #1770

Skylion007 · 2025-08-25T18:21:41Z

The replica_id is not correct; expert-tensor-parallelism (ETP) would only save the model weights in 1 tensor replica. After this change, the file size of FSDP ETP models, and pipeline parallelism non-ETP models match.

copy-pr-bot · 2025-08-25T18:21:44Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Skylion007 · 2025-09-30T19:38:56Z

Ping on this when you have a chance @sbhavani

hxbai · 2025-10-08T06:47:09Z

Hi @Skylion007 the ETP is for TP splitting of the experts, and it should not be replicated along with it.

Could you please paste out the error log with strict load reported in this issue #1836?

they are half the size they should be with TP2, 1/4 with TP4, 1/8 with TP8 etc

Does the TP here means dense TP or expert TP?

Skylion007 · 2025-10-09T17:28:57Z

Hi @Skylion007 the ETP is for TP splitting of the experts, and it should not be replicated along with it.

Could you please paste out the error log with strict load reported in this issue #1836?

they are half the size they should be with TP2, 1/4 with TP4, 1/8 with TP8 etc

Does the TP here means dense TP or expert TP?

It means ETP since in this context. TP in this PR is ETP because the method is wraped with expert_dist_ckpt_decorator
. I would have renamed the variable here, but I wanted to be consistent and prevent code churn.

Increasing ETP to 2, 4, 8 etc shows the reduction in the size of the checkpoint by roughly the factor of ETP (for an MOE model which consists mostly of MOE layers). I do not have any relevant log files still around as I last ran the broken code in August, and honestly debugged in by roundtripping the checkpoint before I set log_all

Also nit, that decorator should using a TypeVar or typing_extensions.ParamSpec with Callable to prevent type erasure of the method it decorates.

Skylion007 · 2025-10-09T18:31:41Z

@hxbai I'll try to replicate it on a more recent version of Megatron looking at the checkpoints I do have is a result of this bug.
#1778

Where it seems to be deleting layers from the model, but the TFLOPs calculation etc all assumes the layers are still there with decoder-last-layers is set and PP is set to 1. Some of the other checkpoints were on the legacy FSDP path and no longer relevant.

Fixes torch_dist checkpointing ETP replica_id

0993706

Skylion007 changed the title ~~Fixes torch_dist checkpointing ETP replica_id~~ Fix torch_dist checkpointing ETP replica_id Aug 25, 2025

sbhavani added bug Something isn't working module: moe labels Sep 9, 2025

Skylion007 mentioned this pull request Sep 29, 2025

Incorrectly saved checkpoints when ETP is enabled, replica_id is accidentally overwrites ETP shards when due to missing tp_rank in key #1836

Open

maanug-nv requested a review from yanring October 1, 2025 21:38

maanug-nv added this to the Core 0.15 milestone Oct 1, 2025

Skylion007 marked this pull request as draft October 9, 2025 17:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix torch_dist checkpointing ETP replica_id #1770

Fix torch_dist checkpointing ETP replica_id #1770

Uh oh!

Skylion007 commented Aug 25, 2025

Uh oh!

copy-pr-bot bot commented Aug 25, 2025

Uh oh!

Skylion007 commented Sep 30, 2025

Uh oh!

hxbai commented Oct 8, 2025 •

edited

Loading

Uh oh!

Skylion007 commented Oct 9, 2025

Uh oh!

Skylion007 commented Oct 9, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Fix torch_dist checkpointing ETP replica_id #1770

Are you sure you want to change the base?

Fix torch_dist checkpointing ETP replica_id #1770

Uh oh!

Conversation

Skylion007 commented Aug 25, 2025

Uh oh!

copy-pr-bot bot commented Aug 25, 2025

Uh oh!

Skylion007 commented Sep 30, 2025

Uh oh!

hxbai commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Skylion007 commented Oct 9, 2025

Uh oh!

Skylion007 commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hxbai commented Oct 8, 2025 •

edited

Loading

Skylion007 commented Oct 9, 2025 •

edited

Loading