Incorrectly saved checkpoints when ETP is enabled, replica_id is accidentally overwrites ETP shards when due to missing tp_rank in key

**Describe the bug**

When enabling expert tensor checkpoints, only the first rank of tensor parallelism of experts are saved due to a missing key in the replica_id for the GroupedLinear class

https://github.com/NVIDIA/Megatron-LM/pull/1770 closes this issue.

A clear and concise description of what the bug is.

**Steps/Code to reproduce bug**

* Train with ETP, try to load the checkpoint with high strictness, note the error. Also observe the checkpoints for training a large MOE like deepseek with ETP, they are half the size they should be with TP2, 1/4 with TP4, 1/8 with TP8 etc...

Please list *minimal* steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.


**Expected behavior**

* All weights are saved

A clear and concise description of what you expected to happen.


**Additional context**

Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incorrectly saved checkpoints when ETP is enabled, replica_id is accidentally overwrites ETP shards when due to missing tp_rank in key #1836

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Incorrectly saved checkpoints when ETP is enabled, replica_id is accidentally overwrites ETP shards when due to missing tp_rank in key #1836

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions