Skip to content

Commit f5ee704

Browse files
Added example training scripts for localsgd, DiLoCo, Live Checkpoint Recovery, and proactive failure detection with DDP, along with CI (#198)
1 parent 7b550aa commit f5ee704

File tree

19 files changed

+1600
-51
lines changed

19 files changed

+1600
-51
lines changed

.github/workflows/examples.yaml

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
name: Examples
2+
3+
on:
4+
push:
5+
branches:
6+
- main
7+
pull_request:
8+
9+
jobs:
10+
unittest:
11+
strategy:
12+
fail-fast: false
13+
matrix:
14+
include:
15+
- runs-on: "linux.2xlarge"
16+
gpu-arch-type: "cpu"
17+
gpu-arch-version: ""
18+
torch-version: "stable"
19+
- runs-on: "linux.g5.12xlarge.nvidia.gpu"
20+
gpu-arch-type: "cuda"
21+
gpu-arch-version: "12.4"
22+
torch-version: "stable"
23+
- runs-on: "linux.g5.12xlarge.nvidia.gpu"
24+
gpu-arch-type: "cuda"
25+
gpu-arch-version: "12.4"
26+
torch-version: "nightly"
27+
28+
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
29+
with:
30+
timeout: 120
31+
runner: ${{ matrix.runs-on }}
32+
gpu-arch-type: ${{ matrix.gpu-arch-type }}
33+
gpu-arch-version: ${{ matrix.gpu-arch-version }}
34+
script: |
35+
set -ex
36+
37+
# install python and protobuf
38+
conda create -n venv python=3.12 libprotobuf -y
39+
conda activate venv
40+
python -m pip install --upgrade pip
41+
42+
# install recent version of Rust via rustup
43+
curl https://sh.rustup.rs -sSf | sh -s -- --default-toolchain=stable --profile=default -y
44+
. "$HOME/.cargo/env"
45+
46+
# Optionally install torch nightly, pulls latest CUDA from pip otherwise
47+
if [ "${{ matrix.torch-version }}" = "nightly" ]; then
48+
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128
49+
fi
50+
if [ "${{ matrix.torch-version }}" = "test" ]; then
51+
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/test/cu128
52+
fi
53+
54+
# Install dependencies
55+
pip install -e .[dev] -v
56+
57+
# Run tests
58+
pytest examples/test_examples.py

README.md

Lines changed: 6 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -79,15 +79,14 @@ We have a minimal DDP train loop that highlights all of the key components in to
7979

8080
See [train_ddp.py](./train_ddp.py) for more info.
8181

82+
### Advanced Examples
8283

83-
### DiLoCo
84-
85-
LocalSGD and DiLoCo are currently experimental.
86-
87-
See
88-
[the diloco_train_loop/local_sgd_train_loop tests](./torchft/local_sgd_integ_test.py)
89-
for an example on how to integrate these algorithms into your training loop.
84+
See the [examples/README.md](./examples/README.md) for advanced examples. Currently, the following examples are available:
9085

86+
- [DDP with proactive failure recovery](./examples/ddp_proactive/README.md): Demonstrates DDP with proactive failure recovery mode
87+
- [DiLoCo](./examples/diloco/README.md): Demonstrates Distributed Local Convergence training
88+
- [LocalSGD](./examples/localsgd/README.md): Demonstrates Local SGD with periodic synchronization
89+
- [Live Checkpoint Recovery](./examples/live_checkpoint_recovery/README.md): Demonstrates live checkpoint recovery
9190

9291
## Design
9392

@@ -246,38 +245,6 @@ CUDA_VISIBLE_DEVICES=1 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --mast
246245

247246
By observing the outputs from both shells, you should observe process group reconfiguration and live checkpoint recovery.
248247

249-
### Proactive Failure Recovery Mode (Experimental)
250-
251-
You can experiment with proactive failure recovery mode by:
252-
253-
```sh
254-
export TORCHFT_PROACTIVE_RECOVERY=1
255-
```
256-
257-
With this enabled, the manager will listen to the Lighthouse server for heartbeat failures of other replica groups and break from a hanging allreduce.
258-
259-
You can test this out by running `train_ddp_proactive.py`
260-
261-
On shell 1 (one replica groups starts initial training):
262-
```sh
263-
export REPLICA_GROUP_ID=0
264-
export NUM_REPLICA_GROUPS=2
265-
export TORCHFT_PROACTIVE_RECOVERY=1
266-
267-
CUDA_VISIBLE_DEVICES=0 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port=29600 --nnodes=1 --nproc_per_node=1 -- train_ddp_proactive.py
268-
```
269-
270-
On shell 2 (a second replica group joins):
271-
```sh
272-
export REPLICA_GROUP_ID=1
273-
export NUM_REPLICA_GROUPS=2
274-
export TORCHFT_PROACTIVE_RECOVERY=1
275-
276-
CUDA_VISIBLE_DEVICES=1 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port=29601 --nnodes=1 --nproc_per_node=1 -- train_ddp_proactive.py
277-
```
278-
279-
You should observe that the process with replica group id 1 will exit early, and the process with replica group id 0 will quickly resume training. If the same script is ran with after setting `export TORCHFT_PROACTIVE_RECOVERY=0`, you should observe that the process with replica group id 1 will hang for dozens of seconds before continuing.
280-
281248
### Example Parameter Server
282249

283250
torchft has a fault tolerant parameter server implementation built on it's

examples/README.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# TorchFT Examples
2+
3+
This directory contains advanced examples demonstrating various fault tolerance features and training approaches in TorchFT beyond the basic `train_ddp.py` example in the [README](../README.md).
4+
5+
Each directory contains a README with more detailed instructions, as well as extensive documentation on the feature being showcased and how to interpret the outputs.
6+
7+
## List of Examples
8+
9+
- [DDP with proactive failure recovery](./ddp_proactive/README.md): Demonstrates DDP with proactive failure recovery mode
10+
- [DiLoCo](./diloco/README.md): Demonstrates Distributed Local Convergence training
11+
- [LocalSGD](./localsgd/README.md): Demonstrates Local SGD with periodic synchronization
12+
- [Live Checkpoint Recovery](./live_checkpoint_recovery/README.md): Demonstrates live checkpoint recovery
13+
14+
## Running the examples
15+
16+
After starting the lighthouse server by running:
17+
18+
```sh
19+
RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000
20+
```
21+
22+
You can `cd` into the example directory:
23+
24+
```sh
25+
cd examples/[example_directory]
26+
```
27+
28+
and then launch the example with torchX with:
29+
30+
```sh
31+
export QUICK_RUN=1
32+
torchx run
33+
```
34+
35+
the QUICK_RUN environment variable runs the examples for much less steps, and also uses a synthetic, rather than downloaded, dataset. It is useful for testing the examples quickly.
36+
37+
See the `.torchxconfig` file in each example directory for configuration details, and [torchx.py](../torchft/torchx.py) and the [torchX documentation](https://pytorch.org/torchx/latest/) to understand how DDP is being ran.

examples/ddp_proactive/.torchxconfig

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
[cli:run]
2+
component=../../torchft/torchx.py:hsdp
3+
scheduler=local_cwd
4+
5+
6+
[component:../../torchft/torchx.py:hsdp]
7+
script=train_ddp_proactive.py

0 commit comments

Comments
 (0)