Skip to content

Incorrectly saved checkpoints when ETP is enabled, replica_id is accidentally overwrites ETP shards when due to missing tp_rank in key #1836

@Skylion007

Description

@Skylion007

Describe the bug

When enabling expert tensor checkpoints, only the first rank of tensor parallelism of experts are saved due to a missing key in the replica_id for the GroupedLinear class

#1770 closes this issue.

A clear and concise description of what the bug is.

Steps/Code to reproduce bug

  • Train with ETP, try to load the checkpoint with high strictness, note the error. Also observe the checkpoints for training a large MOE like deepseek with ETP, they are half the size they should be with TP2, 1/4 with TP4, 1/8 with TP8 etc...

Please list minimal steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

Expected behavior

  • All weights are saved

A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions