pytorch
diff --git a/‎.github/workflows/examples.yaml
Lines changed: 58 additions & 0 deletions b/‎.github/workflows/examples.yaml
Lines changed: 58 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 6 additions & 39 deletions b/‎README.md
Lines changed: 6 additions & 39 deletions
diff --git a/‎examples/README.md
Lines changed: 37 additions & 0 deletions b/‎examples/README.md
Lines changed: 37 additions & 0 deletions
diff --git a/‎examples/ddp_proactive/.torchxconfig
Lines changed: 7 additions & 0 deletions b/‎examples/ddp_proactive/.torchxconfig
Lines changed: 7 additions & 0 deletions
@@ -0,0 +1,58 @@
+name: Examples
+
+on:
+  push:
+    branches:
+      - main
+  pull_request:
+
+jobs:
+  unittest:
+    strategy:
+      fail-fast: false
+      matrix:
+        include:
+          - runs-on: "linux.2xlarge"
+            gpu-arch-type: "cpu"
+            gpu-arch-version: ""
+            torch-version: "stable"
+          - runs-on: "linux.g5.12xlarge.nvidia.gpu"
+            gpu-arch-type: "cuda"
+            gpu-arch-version: "12.4"
+            torch-version: "stable"
+          - runs-on: "linux.g5.12xlarge.nvidia.gpu"
+            gpu-arch-type: "cuda"
+            gpu-arch-version: "12.4"
+            torch-version: "nightly"
+
+    uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
+    with:
+      timeout: 120
+      runner: ${{ matrix.runs-on }}
+      gpu-arch-type: ${{ matrix.gpu-arch-type }}
+      gpu-arch-version: ${{ matrix.gpu-arch-version }}
+      script: |
+        set -ex
+
+        # install python and protobuf
+        conda create -n venv python=3.12 libprotobuf -y
+        conda activate venv
+        python -m pip install --upgrade pip
+
+        # install recent version of Rust via rustup
+        curl https://sh.rustup.rs -sSf | sh -s -- --default-toolchain=stable --profile=default -y
+        . "$HOME/.cargo/env"
+
+        # Optionally install torch nightly, pulls latest CUDA from pip otherwise
+        if [ "${{ matrix.torch-version }}" = "nightly" ]; then
+          pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128
+        fi
+        if [ "${{ matrix.torch-version }}" = "test" ]; then
+          pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/test/cu128
+        fi
+
+        # Install dependencies
+        pip install -e .[dev] -v
+
+        # Run tests
+        pytest examples/test_examples.py
@@ -79,15 +79,14 @@ We have a minimal DDP train loop that highlights all of the key components in to
 
 See [train_ddp.py](./train_ddp.py) for more info.
 
+### Advanced Examples
 
-### DiLoCo
-
-LocalSGD and DiLoCo are currently experimental.
-
-See
-[the diloco_train_loop/local_sgd_train_loop tests](./torchft/local_sgd_integ_test.py)
-for an example on how to integrate these algorithms into your training loop.
+See the [examples/README.md](./examples/README.md) for advanced examples. Currently, the following examples are available:
 
+- [DDP with proactive failure recovery](./examples/ddp_proactive/README.md): Demonstrates DDP with proactive failure recovery mode
+- [DiLoCo](./examples/diloco/README.md): Demonstrates Distributed Local Convergence training
+- [LocalSGD](./examples/localsgd/README.md): Demonstrates Local SGD with periodic synchronization
+- [Live Checkpoint Recovery](./examples/live_checkpoint_recovery/README.md): Demonstrates live checkpoint recovery
 
 ## Design
 
@@ -246,38 +245,6 @@ CUDA_VISIBLE_DEVICES=1 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --mast
 
 By observing the outputs from both shells, you should observe process group reconfiguration and live checkpoint recovery.
 
-### Proactive Failure Recovery Mode (Experimental)
-
-You can experiment with proactive failure recovery mode by:
-
-```sh
-export TORCHFT_PROACTIVE_RECOVERY=1
-```
-
-With this enabled, the manager will listen to the Lighthouse server for heartbeat failures of other replica groups and break from a hanging allreduce.
-
-You can test this out by running `train_ddp_proactive.py`
-
-On shell 1 (one replica groups starts initial training):
-```sh
-export REPLICA_GROUP_ID=0
-export NUM_REPLICA_GROUPS=2
-export TORCHFT_PROACTIVE_RECOVERY=1
-
-CUDA_VISIBLE_DEVICES=0 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port=29600 --nnodes=1 --nproc_per_node=1 -- train_ddp_proactive.py
-```
-
-On shell 2 (a second replica group joins):
-```sh
-export REPLICA_GROUP_ID=1
-export NUM_REPLICA_GROUPS=2
-export TORCHFT_PROACTIVE_RECOVERY=1
-
-CUDA_VISIBLE_DEVICES=1 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port=29601 --nnodes=1 --nproc_per_node=1 -- train_ddp_proactive.py
-```
-
-You should observe that the process with replica group id 1 will exit early, and the process with replica group id 0 will quickly resume training. If the same script is ran with after setting `export TORCHFT_PROACTIVE_RECOVERY=0`, you should observe that the process with replica group id 1 will hang for dozens of seconds before continuing.
-
 ### Example Parameter Server
 
 torchft has a fault tolerant parameter server implementation built on it's
 
@@ -0,0 +1,37 @@
+# TorchFT Examples
+
+This directory contains advanced examples demonstrating various fault tolerance features and training approaches in TorchFT beyond the basic `train_ddp.py` example in the [README](../README.md).
+
+Each directory contains a README with more detailed instructions, as well as extensive documentation on the feature being showcased and how to interpret the outputs.
+
+## List of Examples
+
+- [DDP with proactive failure recovery](./ddp_proactive/README.md): Demonstrates DDP with proactive failure recovery mode
+- [DiLoCo](./diloco/README.md): Demonstrates Distributed Local Convergence training
+- [LocalSGD](./localsgd/README.md): Demonstrates Local SGD with periodic synchronization
+- [Live Checkpoint Recovery](./live_checkpoint_recovery/README.md): Demonstrates live checkpoint recovery
+
+## Running the examples
+
+After starting the lighthouse server by running:
+
+```sh
+RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000
+```
+
+You can `cd` into the example directory:
+
+```sh
+cd examples/[example_directory]
+```
+
+and then launch the example with torchX with:
+
+```sh
+export QUICK_RUN=1
+torchx run
+```
+
+the QUICK_RUN environment variable runs the examples for much less steps, and also uses a synthetic, rather than downloaded, dataset. It is useful for testing the examples quickly.
+
+See the `.torchxconfig` file in each example directory for configuration details, and [torchx.py](../torchft/torchx.py) and the [torchX documentation](https://pytorch.org/torchx/latest/) to understand how DDP is being ran. 
@@ -0,0 +1,7 @@
+[cli:run]
+component=../../torchft/torchx.py:hsdp
+scheduler=local_cwd
+
+
+[component:../../torchft/torchx.py:hsdp]
+script=train_ddp_proactive.py