Describe the bug
When enabling expert tensor checkpoints, only the first rank of tensor parallelism of experts are saved due to a missing key in the replica_id for the GroupedLinear class
#1770 closes this issue.
A clear and concise description of what the bug is.
Steps/Code to reproduce bug
- Train with ETP, try to load the checkpoint with high strictness, note the error. Also observe the checkpoints for training a large MOE like deepseek with ETP, they are half the size they should be with TP2, 1/4 with TP4, 1/8 with TP8 etc...
Please list minimal steps or code snippet for us to be able to reproduce the bug.
A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here.