Skip to content

benkirk/derecho-pytorch-mpi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Building Pytorch on NCAR's Derecho supercomputer with a CUDA-Aware Cray-MPICH backend

This repository implements a general process for building recent versions of pytorch (~circa 2024) on Derecho from source. The purpose is to build a version of pytorch to use in distributed ML-training workflows making optimal use of the Cray-EX Slingshot 11 (SS11) interconnect.

Distributed-ML in general and SS11 in particular pose some challenges that drive us to build from source rather than choose any of the available Pytorch versions from e.g conda-forge. Specifically:

  • We want to enable a CUDA-Aware MPI backend using cray-mpich. (Currently for pytorch any level of MPI support requires building from source.)
  • We want to use a SS11-optimized NCCL. As of this writing, this requires compiling NCCL from source along with using the AWS OFI NCCL Plugin at specific versions and with specific runtime environment variable settings.
    • Note that when installing pytorch from conda-forge, a non-optimal NCCL will generally be installed. The application may appear functional but performance will be much degraded for distributed training.
    • Therefore the approach taken here is to install the desired NCCL_plugin, and point pytorch to this version at build time to minimize the likelihood of using a non-optimal version.

User Installation

Quick start

  1. Clone this repo.
    git clone https://github.com/benkirk/derecho-pytorch-mpi.git
    cd derecho-pytorch-mpi
  2. On a Derecho login node:
    export PBS_ACCOUNT=<my_project_ID>
    
    # build default version of pytorch (currently v2.3.1):
    make build-pytorch-v2.3.1-pbs
    
    # build pytorch-v2.4.0, also supported:
    export PYTORCH_VERSION=v2.4.0
    make build-pytorch-v2.4.0-pbs
  3. Run a sample pytorch.dist + MPI backend test on 2 GPU nodes:
    # (from a login node)
    # (1) request an interactive PBS session with 2 GPU nodes:
    qsub -I -l select=2:ncpus=64:mpiprocs=4:ngpus=4 -A ${PBS_ACCOUNT} -q main -l walltime=00:30:00
    
    # (inside PBS)
    # (2) activate the conda environment:
    module load conda
    conda activate ./env-pytorch-v2.4.0-derecho-gcc-12.2.0-cray-mpich-8.1.27
    
    # (3) run a minimal torch.dist program with the MPI backend:
    mpiexec -n 8 -ppn 4 --cpu-bind numa ./tests/all_reduce_test.py

Customizing the resulting conda environment

The process outlined above will create a minimal conda environment in the current directory containing the pytorch build dependencies and the installed version of pytorch itself. The package list is defined in config_env.sh - users may elect to add packages to the embedded conda.yaml file, or later through the typical conda install command from within the environment.

Developer Details

Important files

  1. config_env.sh: Must be sourced to properly build pytorch.
    • Sourcing this file will activate the appropriate env-pytorch=${PYTORCH_VERSION}-[...] conda environment from the same directory.
    • If the conda environment does not exist, it will create it. Which in turn requires checking out the pytorch source tree, as this is required to properly define the required conda build environment.
    • Therefore this script controls the packages added initially to the conda environment.
    • Defines environment variables required to build pytorch.
    • Creates an activation script env-pytorch-${PYTORCH_VERSION}-[...]/etc/conda/activate.d/derecho-env_vars.sh with preferred runtime settings.
    • After installation, the resulting conda environments can be activated directly without the need for config_env.sh, and should be compatible with the default module environment on Derecho.
      • Re-sourcing this script is not a problem if desired, and will result in the same module environment used to build pytorch.
  2. Makefile contains convenient rules for automation and a reproducible process. Uses the environment variables PYTORCH_VERSION and PBS_ACCOUNT, with sensible defaults for each.
  3. patches/${PYTORCH_VERSION}/*: Any required version-specific patches are located in this directory tree, and are applied in *-wildcard order.
  4. utils/build_nccl-ofi-plugin.sh: builds a compatible NCCL+AWS OFI plugin for use on Derecho with Cray's libfabric. Must be updated periodically with underlying libfabric version changes.

Pytorch, CUDA-Awareness, and Cray-MPICH

pytorch-v2 source only supports CUDA-Aware MPI backend when running under OpenMPI. This is due to some overzealous config settings that probe for CUDA support using MPIX_... extensions not available with Cray-MPICH, and implemented inside #ifdef OPEN_MPI ... anyway. Where these tests occur, when they fail they fall back to assuming the MPI is not CUDA-Aware.

Fortunately the fix is fairly straightforward, find all the places these checks occur and instead fall back to assuming MPI is CUDA-Aware. For example, see patches/v2.3.1/01-cuda-aware-mpi.

About

Build infrastructure for Pytorch on top of Cray-MPICH on NCAR's Derecho supercomputer

Resources

Stars

Watchers

Forks

Packages

No packages published