Skip to content

Conversation

@Skylion007
Copy link

The replica_id is not correct; expert-tensor-parallelism (ETP) would only save the model weights in 1 tensor replica. After this change, the file size of FSDP ETP models, and pipeline parallelism non-ETP models match.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 25, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Skylion007 Skylion007 changed the title Fixes torch_dist checkpointing ETP replica_id Fix torch_dist checkpointing ETP replica_id Aug 25, 2025
@sbhavani sbhavani added bug Something isn't working module: moe labels Sep 9, 2025
@Skylion007
Copy link
Author

Ping on this when you have a chance @sbhavani

@maanug-nv maanug-nv requested a review from yanring October 1, 2025 21:38
@maanug-nv maanug-nv added this to the Core 0.15 milestone Oct 1, 2025
@hxbai
Copy link
Contributor

hxbai commented Oct 8, 2025

Hi @Skylion007 the ETP is for TP splitting of the experts, and it should not be replicated along with it.

Could you please paste out the error log with strict load reported in this issue #1836?

they are half the size they should be with TP2, 1/4 with TP4, 1/8 with TP8 etc

Does the TP here means dense TP or expert TP?

@Skylion007
Copy link
Author

Hi @Skylion007 the ETP is for TP splitting of the experts, and it should not be replicated along with it.

Could you please paste out the error log with strict load reported in this issue #1836?

they are half the size they should be with TP2, 1/4 with TP4, 1/8 with TP8 etc

Does the TP here means dense TP or expert TP?

It means ETP since in this context. TP in this PR is ETP because the method is wraped with expert_dist_ckpt_decorator
. I would have renamed the variable here, but I wanted to be consistent and prevent code churn.

Increasing ETP to 2, 4, 8 etc shows the reduction in the size of the checkpoint by roughly the factor of ETP (for an MOE model which consists mostly of MOE layers). I do not have any relevant log files still around as I last ran the broken code in August, and honestly debugged in by roundtripping the checkpoint before I set log_all

Also nit, that decorator should using a TypeVar or typing_extensions.ParamSpec with Callable to prevent type erasure of the method it decorates.

@Skylion007 Skylion007 marked this pull request as draft October 9, 2025 17:56
@Skylion007
Copy link
Author

Skylion007 commented Oct 9, 2025

@hxbai I'll try to replicate it on a more recent version of Megatron looking at the checkpoints I do have is a result of this bug.
#1778

Where it seems to be deleting layers from the model, but the TFLOPs calculation etc all assumes the layers are still there with decoder-last-layers is set and PP is set to 1. Some of the other checkpoints were on the legacy FSDP path and no longer relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working module: moe

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants