Throughout this course, two guiding principles have been kept in mind:
-
Complete Pipeline Approach. Getting to the point where you can simply:
import cupy as cp
is far harder in the context of using GPUs than writing code for GPUs.
-
Focus on practical GPU use. This is a course about using GPUs, not about the low-level details of programming GPUs.
By the end, you’ll have everything in place to leverage GPU acceleration immediately. We’ll walk you through installing the tools, configuring your environment, and running your first CUDA-powered code, so you can start leveraging the benefit of GPUs.
If you have not used an HPC platform before, then you may benefit from going through the material in "Helpful Auxiliary Software" on this page, as it will guide you through the process of connecting to an HPC platform, after which you can continue moving through these set-up instructions.
To engage with all of the content within this GPU Training course, you will need to clone the repo, which can be done with
cd /lustre/projects/Research_Project-RSATeam #This is the directory that the RSA Team will do the course in.
mkdir $USER # Create a directory for you within the project space.
cd $USER
git clone https://github.com/UniExeterRSE/GPU_Training.git
cd GPU_Training
If you prefer to work interactively, follow these steps:
Request an interactive session:
srun \
--partition=gpu \
-A Research_Project-RSATeam \
--time=12:00:00 \
--nodes=1 \
--ntasks=1 \
--gres=gpu:1 \
--cpus-per-task=4 \
--pty /bin/bash
Load required modules:
module load nvidia-cuda/12.1.1
module load Python/3.11.3
Install the Python requirements:
poetry install
Once your environment is ready, you can invoke any of the project’s entry points via Poetry. For example:
poetry run cuda_check
All of the key Slurm submission scripts live in the
exeter_isca_slurm_submission_scripts/
directory. You can submit a job with
cd exeter_isca_slurm_submission_scripts
sbatch <script-name>.slurm
The following provides the steps that are required to install the necessary compilers and packages to engage with the material in this course.
Please keep in mind that nearly all of the commands used in this section will be covered in detail within the course itself. They are included here to make sure you have all of the necessary resources (e.g. a GPU and relevant compilers) to complete the whole course. **The intention is for you to run these commands and confirm the output based on the contents of this page, not to completely understand each step you are taking.** If you do get stuck and are unsure of how to proceed, please reach out to the authors, and we can help you debug.
If you are self-studying, then please read up to the section "Project: Conway's Game of Life - CPU vs GPU Implementation" to understand more about the commands that are being used. If you are taking the workshop, then these commands are here to make sure that you are able to run code on the designated platform to save time in the workshop and identify any permission errors when accessing the needed resources.
Within this course, Spack is being used to manage system-level requirements, such as drivers. The reason for this is that a lot of system-level requirements generally require privileged permissions, such as access to sudo
. However, as a lot of the platforms that have GPUs available are HPC platforms, spack
allows us to install drivers that normally would require privileged access. There are also a range of other benefits to the use of spack
that will be discussed in this course.
First, you will need to clone the spack
repo in your user home directory at a recent stable version (extra config and depth flags suggested in spack's readme):
git clone -c feature.manyFiles=true --depth=2 -b v0.23.1 https://github.com/spack/spack.git
You will then need to activate spack
with:
source spack/share/spack/setup-env.sh
You can check that `spack` has been successfully installed by running `spack --version`, which should return the version of spack that you have available.
You will need need to create a spack environment, which can be done with the following, creating a spack
environment named "gpu_course":
spack env create gpu_course
which can then be activated with
spack env activate -p gpu_course
In this course, spack is being used to install system-level requirements, and so the required version of Python and the needed driver of CUDA are installed via spack with the following two commands.
spack add [email protected]
spack add cuda
This step will simply say that you intend to install these packages; at this time, `spack` is still waiting for more packages to be added to the environment specification. We can check what the current specification is (e.g. package list, dependencies, compilers to be used etc.) with `spack spec`.
Finally, we are able to install all of the packages into our spack
environment with
spack install
On an HPC environment, we would want to put the above spack commands into a shell script and run this with the scheduler, such as `sbatch` for ISCA/Archer2. The `install` can take on the order of hours for the above specifications.
The `.spack` directory is a hidden folder in your home directory that stores user-level configuration data, caches, and environment settings for Spack. It helps Spack remember things like what packages you have installed, which mirrors you have configured, and any custom settings you have applied. Sometimes, these configuration files or caches can become outdated or inconsistent, especially if you have been experimenting with different environments, modifying package recipes, or changing `spack` versions. When a "weird" or hard-to-troubleshoot error occurs, one way to rule out bad configuration or cache data is to remove the `.spack` directory. By doing so, you essentially give Spack a clean slate: it will recreate the directory and its necessary files the next time it runs, which often resolves mysterious issues stemming from old or corrupted data. If you try to get a clean slate for spack by just removing the non-hidden `spack` directory, then it will likely not be a clean slate, and the previous experimentations data will still be present.
Within this course, Poetry is used to manage the user-level requirements.
The following command will install poetry:
curl -sSL https://install.python-poetry.org | python3 -
Poetry can be uninstalled with `curl -sSL https://install.python-poetry.org | python3 - --uninstall`.
`poetry install` needs to be run from within the training course repo. If you haven't, then you need to clone this repo with `git clone https://github.com/UniExeterRSE/GPU_Training.git` and then navigate to its root with `cd GPU_Training`
All of the user-level requirements can be installed via Poetry with the command:
poetry install
:class: important
You can check that the installation has been successful by running `poetry run cuda_check`, which should return the number of CUDA devices that are currently available, such as `Number of CUDA devices: 1`. If you want to find out more information about the device that is connected, you can run a command such as `nvidia-semi` for an NVIDIA GPU.
:class: important
If you are working on an HPC cluster via SLURM, submit the `cuda_check.slurm` script instead of running the commands directly. The script contain the same commands as above (e.g. `poetry run cuda_check` and `nvidia-smi`) that the `.slurm` script will run and store the output and errors in the files `out.log` and `err.log` respectively. This can be done with the command `sbatch slurm_submission_scripts/cuda_check.slurm`.
For the RSA Team Day the data files are available on the shared ISCA file-system.
To download the dataset, follow these steps:
-
Create a Copernicus Marine Account:
- You will need an account to access the data. Register here: Register for Account.
-
Run the CLI Command to Download the Dataset:
-
Use the following command to download the subset of data:
poetry run download_data
-
This command will prompt you to enter your username and password. Once authenticated, the data file will download to the data directory. Please note that the download may take some time as the file size is approximately 250 MB.
-
The dataset used during the course is based on 3-dimensional Ocean Temperatures. The dataset is described in detail on the Copernicus Marine Data Service
Filename: cmems_mod_glo_phy-thetao_anfc_0.083deg_PT6H-i_1730799065517.nc
Description: This dataset was downloaded from the Global Ocean Physics Analysis and Forecast service. It provides data for global ocean physics, focusing on seawater potential temperature.
- Product Identifier:
GLOBAL_ANALYSISFORECAST_PHY_001_024
- Product Name: Global Ocean Physics Analysis and Forecast
- Dataset Identifier:
cmems_mod_glo_phy-thetao_anfc_0.083deg_PT6H-i
Variable Visualized:
- Sea Water Potential Temperature (thetao): Measured in degrees Celsius [°C].
Geographical Area of Interest:
- Region: Around the United Kingdom
- Coordinates:
- Northern Latitude: 65.312
- Eastern Longitude: 6.1860
- Southern Latitude: 46.829
- Western Longitude: -13.90
Depth Range:
- Minimum Depth: 0.49 meters
- Maximum Depth: 5727.9 meters
File Size:
- 267.5 MB
This section details a number of useful pieces of software that make the development of GPU code easier. Notably, a lot of these sit within Visual Studio Code, chosen as these are what the author was exposed to when first starting in GPU development.
Visual Studio Code (VSCode) can be installed from its website.
This guide walks you through setting up and using Remote-SSH in Visual Studio Code (VSCode) to connect to a remote machine.
Install from Remote-SSH or via the following steps:
- Open VSCode.
- Go to the Extensions view by clicking on the square icon in the sidebar or pressing
Ctrl+Shift+X
(Windows/Linux) orCmd+Shift+X
(Mac). - Search for "Remote - SSH" and install the extension from Microsoft.
Ensure you can SSH into the remote machine from your terminal. If SSH is not already configured:
-
Generate SSH Keys (if not already done):
- Open a terminal on your local machine.
- Run the command
ssh-keygen
and follow the prompts to generate a key pair. This will create keys in~/.ssh/
by default.
-
Copy Your Public Key to the Remote Machine:
- Run the command
ssh-copy-id user@hostname
, replacinguser
andhostname
with your remote machine’s username and IP address or hostname. - Enter your password when prompted. This step ensures you can log in without repeatedly typing your password.
- Run the command
-
Open VSCode.
-
Press
Ctrl+Shift+P
(Windows/Linux) orCmd+Shift+P
(Mac) to open the command palette. -
Type and select Remote-SSH: Open Configuration File.
-
Choose the SSH configuration file (usually located at
~/.ssh/config
). -
Add a new SSH configuration to the file, specifying the remote machine’s details. Here’s an example configuration:
Host my-remote-machine HostName <remote-ip-or-hostname> User <your-username> IdentityFile ~/.ssh/id_rsa # Path to your SSH private key Port 22 # Default SSH port; change if needed
You should now be able to connect to the remote machine from within VSCode but using Ctrl+Shift+P
(Windows/Linux) or Cmd+Shift+P
(Mac) and then selecting Remote-SSH: Connect to host...
which should then present a list with the name of the machine you gave in the config file, in the above case my-remote-machine
. You will then be asked for a password if you protected your ssh key. Once connected, a new VSCode window will be created, and you should have a fully functioning ID on the remote machine.
As this course produces 3D outputs, some supporting code will generate interactive HTML dashboards to make exploring the output data easier. The VSCode Live Server extension makes the process of viewing these dashboards with your local web browser easier.
Install from Live Server or via the following steps:
- Open VSCode.
- Go to the Extensions view by clicking on the square icon in the sidebar or pressing
Ctrl+Shift+X
(Windows/Linux) orCmd+Shift+X
(Mac). - Search for "Live Server" and install the extension by Ritwick Dey.
- Right-click on the HTML file in the editor and select Open with Live Server.
- As you edit and save your HTML, CSS, or JavaScript files, the browser will automatically refresh to display your changes.
- This eliminates the need to manually refresh the browser manually, speeding up development.