This repository implements a general process for building recent versions of pytorch
(~circa 2024) on Derecho from source.
The purpose is to build a version of pytorch
to use in distributed ML-training workflows making optimal use of the Cray-EX Slingshot 11 (SS11) interconnect.
Distributed-ML in general and SS11 in particular pose some challenges that drive us to build from source rather than choose any of the available Pytorch versions
from e.g conda-forge
. Specifically:
- We want to enable a CUDA-Aware MPI backend using
cray-mpich
. (Currently forpytorch
any level of MPI support requires building from source.) - We want to use a SS11-optimized NCCL. As of this writing, this requires compiling NCCL from source along with using the AWS OFI NCCL Plugin at specific versions and with specific runtime environment variable settings.
- Note that when installing
pytorch
fromconda-forge
, a non-optimal NCCL will generally be installed. The application may appear functional but performance will be much degraded for distributed training. - Therefore the approach taken here is to install the desired NCCL_plugin, and point
pytorch
to this version at build time to minimize the likelihood of using a non-optimal version.
- Note that when installing
- Clone this repo.
git clone https://github.com/benkirk/derecho-pytorch-mpi.git cd derecho-pytorch-mpi
- On a Derecho login node:
export PBS_ACCOUNT=<my_project_ID> # build default version of pytorch (currently v2.3.1): make build-pytorch-v2.3.1-pbs # build pytorch-v2.4.0, also supported: export PYTORCH_VERSION=v2.4.0 make build-pytorch-v2.4.0-pbs
- Run a sample
pytorch.dist
+ MPI backend test on 2 GPU nodes:# (from a login node) # (1) request an interactive PBS session with 2 GPU nodes: qsub -I -l select=2:ncpus=64:mpiprocs=4:ngpus=4 -A ${PBS_ACCOUNT} -q main -l walltime=00:30:00 # (inside PBS) # (2) activate the conda environment: module load conda conda activate ./env-pytorch-v2.4.0-derecho-gcc-12.2.0-cray-mpich-8.1.27 # (3) run a minimal torch.dist program with the MPI backend: mpiexec -n 8 -ppn 4 --cpu-bind numa ./tests/all_reduce_test.py
The process outlined above will create a minimal conda
environment in the current directory containing the pytorch
build dependencies and the installed version of pytorch
itself. The package list is defined in config_env.sh
- users may elect to add packages to the embedded conda.yaml
file, or later through the typical conda install
command from within the environment.
config_env.sh
: Must be sourced to properly buildpytorch
.- Sourcing this file will activate the appropriate
env-pytorch=${PYTORCH_VERSION}-[...]
conda
environment from the same directory. - If the
conda
environment does not exist, it will create it. Which in turn requires checking out thepytorch
source tree, as this is required to properly define the requiredconda
build environment. - Therefore this script controls the packages added initially to the
conda
environment. - Defines environment variables required to build
pytorch
. - Creates an activation script
env-pytorch-${PYTORCH_VERSION}-[...]/etc/conda/activate.d/derecho-env_vars.sh
with preferred runtime settings. - After installation, the resulting
conda
environments can be activated directly without the need forconfig_env.sh
, and should be compatible with the default module environment on Derecho.- Re-sourcing this script is not a problem if desired, and will result in the same module environment used to build
pytorch
.
- Re-sourcing this script is not a problem if desired, and will result in the same module environment used to build
- Sourcing this file will activate the appropriate
Makefile
contains convenient rules for automation and a reproducible process. Uses the environment variablesPYTORCH_VERSION
andPBS_ACCOUNT
, with sensible defaults for each.patches/${PYTORCH_VERSION}/*
: Any required version-specific patches are located in this directory tree, and are applied in*
-wildcard order.utils/build_nccl-ofi-plugin.sh
: builds a compatible NCCL+AWS OFI plugin for use on Derecho with Cray'slibfabric
. Must be updated periodically with underlyinglibfabric
version changes.
pytorch-v2
source only supports CUDA-Aware MPI backend when running under OpenMPI. This is due to some overzealous config settings that probe for CUDA support using MPIX_...
extensions not available with Cray-MPICH, and implemented inside #ifdef OPEN_MPI ...
anyway. Where these tests occur, when they fail they fall back to assuming the MPI is not CUDA-Aware.
Fortunately the fix is fairly straightforward, find all the places these checks occur and instead fall back to assuming MPI is CUDA-Aware. For example, see patches/v2.3.1/01-cuda-aware-mpi
.