PyTorch Distributed Data Parallel (DDP)

This repository provides a step-by-step guide and examples for implementing distributed training in PyTorch, ranging from single GPU setups to multi-node clusters. The goal is to demonstrate the progression from simple to complex distributed training scenarios.

Single GPU Training
- Basic PyTorch training on a single GPU
- Demonstrates fundamental concepts and serves as a baseline
Multi-GPU Training on a Single Machine
- Utilizes PyTorch's DistributedDataParallel (DDP)
- Shows how to parallelize training across multiple GPUs on one machine
Multi-Node Training on a Slurm Cluster
- Advanced setup for distributed training across multiple machines
- Demonstrates integration with Slurm workload manager

Key Features

Detailed setup instructions for each scenario
Example scripts with clear explanations
Practical tips for efficient distributed training
Demonstration on both synthetic data and a realistic recommendation system use case

Getting Started

Start with the single GPU example and progress through the more complex setups as you become comfortable with the concepts.

Requirements

PyTorch (with CUDA support for GPU training)
torchrun (for launching distributed jobs)
Access to GPU resources (local or cloud-based)
Slurm cluster access (for multi-node examples)

~ still in progress ~

I'll demo each scenario on fake data, and on a more realistic use case (Recommendation System).

OTHER TO DO:

Add Opt to save/load funcs
Demo sharding of data for cases where we can't fit the dataset in memory

Examples

Single GPU

This is non-distributed and is how most will get started.

a) You can use your local machine GPU, run nvidia-smi in the terminal to view details.

simply run python single_gpu.py

If you face issues see troubleshooting below. First check is to uninstall torch and ensure you install the CUDA enabled version.

b) Run on a remote GPU, e.g. on RunPod

I'll come back to that in section 2...

Multi GPU

Most will have to use GPU-as-a-service platforms. I use RunPod.

High level steps:

Sign up and add some funds to your account, e.g. a few pounds/dollars for now.
Ensure you have the Remote SHH extension installed in VSCode.
Next we need an SSH key. If you don't have one, these are the steps:
1. Create a key pair in a terminal or powershell window as follows: ssh-keygen -t ed25519
2. Get your public key (you can use the following command if you used the defaults)
Windows: type %USERPROFILE%\.ssh\id_ed25519.pub

Mac: cat ~/.ssh/id_ed25519.pub
The next steps and following RunPod setup are shown well in this video

You could of course select a pod and provision only one GPU, which is a good option if you're model fits and you just need more horsepower.

However for Multi GPU training, simply provision more GPUs for your chosen pod.

Once your pod is running, follow the instructions in the video above to connect through VSCode.

Once you have your remote session running, you need to git clone the repo.

Next, open the folder where the repo has been written to.

Then open the terminal and run:

torchrun --standalone --nproc_per_node=4 multi_gpu_torchrun.py 50 10

The --nproc_per_node=4 refers to four GPUs, if I had two, I'd put --nproc_per_node=2.

The script should run on all specified nodes.

Mutli-node on a Slurm Cluster

--Work-in-Progress--

Setup Helpers

Create Virtual Environment

python -m venv .venv

.venv\Scripts\activate

Press Ctrl+Shift+P, type "Python: Select Interpreter", and choose the interpreter in the .venv folder.

Runpod

Sign up with this link

Add funds to your account.

Then create an new API key. Copy it and save it somewhere. E.g. a .env file (make sure to add a .gitignore)

Troubleshooting

Check CUDA version

nvcc --version

Check local GPU spec.

nvidia-smi

Check Python version

I had to use 3.9 to ensure compatibility with CUDA.

python --version

Installing PyTorch

https://pytorch.org/get-started/locally/

Downloading Movielens

In PowerShell

Expand-Archive -Path ml-latest-small.zip -DestinationPath .
Remove-Item ml-latest-small.zip
Rename-Item -Path ml-latest-small -NewName movielens_small```

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
examples		examples
rec-sys		rec-sys
.gitignore		.gitignore
readme.MD		readme.MD
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PyTorch Distributed Data Parallel (DDP)

Contents

Key Features

Getting Started

Requirements

Examples

Setup Helpers

Create Virtual Environment

Runpod

Troubleshooting

Check CUDA version

Check local GPU spec.

Check Python version

Installing PyTorch

Downloading Movielens

About

Uh oh!

Releases

Packages

Uh oh!

Languages

jSwords91/pytorch-ddp

Folders and files

Latest commit

History

Repository files navigation

PyTorch Distributed Data Parallel (DDP)

Contents

Key Features

Getting Started

Requirements

Examples

Setup Helpers

Create Virtual Environment

Runpod

Troubleshooting

Check CUDA version

Check local GPU spec.

Check Python version

Installing PyTorch

Downloading Movielens

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages