Keep the training data continuous and the total batch size constant regardless of changes in the replica world size. #292

zhengchenyu · 2025-11-18T07:35:09Z

The current train_ddp.py has two problems:

It cannot guarantee the sequential reading of each sample. For example, the replica group world size is 3, but only 2 replicas are working. Some samples will be missing.
When the replica group world size changes, the total batch size used for gradient aggregation will change. This makes idempotency computation impossible.

The following modifications were made:

SkipDistributedSampler is provided to ensure that training can resume from any offset.
The dataloader is reconfigured when the quorum changes.
For the training rounds that were just initialized and when the quorum changed, the commit will be abandoned due to the setting of the dirty flag.
Add example train_ddp2.py.

…egardless of changes in the replica world size.

Keep the training data continuous and the total batch size constant r…

99f9863

…egardless of changes in the replica world size.

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 18, 2025

zhengchenyu mentioned this pull request Nov 18, 2025

Include Option to Keep Global Batch Size Constant #186

Open

fix style

31cecf0

Provide feedback