Skip to content

Conversation

@zhengchenyu
Copy link
Contributor

The current train_ddp.py has two problems:

  • It cannot guarantee the sequential reading of each sample. For example, the replica group world size is 3, but only 2 replicas are working. Some samples will be missing.
  • When the replica group world size changes, the total batch size used for gradient aggregation will change. This makes idempotency computation impossible.

The following modifications were made:

  • SkipDistributedSampler is provided to ensure that training can resume from any offset.
  • The dataloader is reconfigured when the quorum changes.
  • For the training rounds that were just initialized and when the quorum changed, the commit will be abandoned due to the setting of the dirty flag.
  • Add example train_ddp2.py.

…egardless of changes in the replica world size.
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant