Skip to content

Conversation

ma595
Copy link
Collaborator

@ma595 ma595 commented Jul 23, 2025

This adds the derecho-GPU machine to the ccs-config/machines external which is a tagged release of https://github.com/ESMCI/ccs_config_cesm corresponding to https://github.com/ESMCI/ccs_config_cesm/releases/tag/ccs_config_cesm0.0.82

This modifies:

  • config_batch.xml
  • config_machine.xml
  • Adds config_compilers.xml Hardcoded paths to NETCDF_PATH and PNETCDF_PATH.

The trick was to put the derecho-gpu machine configuration above the derecho machine. This was because of the NODENAME_REGEX in derecho which autodetects the machine that is being run on. I suspect that commenting out this line might suffice.

It's quite clear that there's a bit of a mismatch between the GPU configuration *.xml files provided (from Will Chapman) and the CIME build system we're using. More investigation is required. I copied only the relevant content from the files and the configuration for derecho's gpus seems to build, with a few caveats:

Even after loading the following module set:

module --force purge
module load cesmdev/1.0 ncarenv/23.06 craype/2.7.20 intel/2023.0.0 mkl/2023.0.0 ncarcompilers/1.0.0
module load cmake/3.26.3 cray-mpich/8.1.25 hdf5-mpi/1.12.2 netcdf-mpi/4.9.2 parallel-netcdf/1.12.3
module load parallelio/2.6.2 esmf/8.6.0b04 cuda

NETCDF could not be found. This is because in the jobs dir:

cmake_macros/CNL.cmake requires NETCDF_DIR.

This is fixed by:

export NETCDF_DIR=/glade/u/apps/derecho/23.06/spack/opt/spack/netcdf/4.9.2/cray-mpich/8.1.25/oneapi/2023.0.0/wzol

I also hardcoded some paths for debugging purposes.

Before merging:

  • Fix hardcoded paths
  • Understand more about build system
  • Check whether config_compilers.xml is used? This might be the way to fix the below.
  • Test execution (unable to currently due to exhausting project GPU resources)

Test as follows:

source ./env.sh # with above environment modules loaded
cd ~/CAM/cime/scripts
./create_newcase --case ~/jobs/CAM_column_GPU-1 --compset FMTHIST --res ne30pg3_ne30pg3_mg17 --project USTN0009 --machine derecho-gpu
cd ~/jobs/CAM_column_GPU-1
export NETCDF_DIR=/glade/u/apps/derecho/23.06/spack/opt/spack/netcdf/4.9.2/cray-mpich/8.1.25/oneapi/2023.0.0/wzol 
./case.setup
./case.build
>>>grep -ri "config_compilers.xml" .

./env_case.xml:    <entry id="COMPILERS_SPEC_FILE" value="$SRCROOT/ccs_config/machines/config_compilers.xml">
./LockedFiles/env_case.xml:    <entry id="COMPILERS_SPEC_FILE" value="$SRCROOT/ccs_config/machines/config_compilers.xml">

@ma595 ma595 self-assigned this Jul 23, 2025
@ma595
Copy link
Collaborator Author

ma595 commented Jul 29, 2025

Quick update.

New derecho-gpu machine works with case.submit, during testing I did the following in jobs dir.

./xmlchange DEBUG=False # True is the default anyway
./xmlchange JOB_QUEUE=develop
./xmlchange JOB_WALLCLOCK_TIME=00:30:00

The fix involved removing the compilers file and adding the cmake/derecho-gpu.cmake file. The config_batch.xml has also been given some sensible defaults (CAM was silently failing initially indicating OOM errors - if memory is unspecified its default is too low). In theory, number of gpus can also be set with xmlchange NGPUS_PER_NODE=4, but this is yet to be tested. I modified gw_nlgw.F90 to use 4 gpus (a hardcoded value) initially, and this ran for 30 mins.

  • Test ./xmlchange NGPUS_PER_NODE=4

I suspect that this xmlchange doesnt make sense.

@ma595
Copy link
Collaborator Author

ma595 commented Aug 18, 2025

Returning to this after a break:

Some useful tips for future:

During development I tested with the following:

qinteractive --nchunks 1 --ncpus 64 --ngpus 2 --mem 235G @derecho -A USTN0009 -l walltime=01:00:00

Checked how to run resubmit jobs - I believe jack ran 6 monthly jobs?
main queue should give you access to gpus
But it seems like I was previously running across 64 nodes with 1 process per node!

Trying again:
Hardcoded -ngpus=4 (this is what Will has done previously) - I tried to set NGPUS_PER_NODE as a configurable option but this led to issues. Setting this in the config_machines.xml leads to error.

Further observations

  • Negative refers to nodes. (-4) is 4 nodes.
./create_newcase --case ~/jobs/CAM_column_GPU-12 --compset FMTHIST --res ne30pg3_ne30pg3_mg17 --project USTN0009 --machine derecho
./xmlchange NTASKS=64
./xmlchange JOB_WALLCLOCK_TIME=00:45:00
./xmlchange JOB_QUEUE=main
./case.setup
./case.build # about 5 mins

Approach

  • Run interactive job on 64 processors on gpu node with Single column GW ML emulation enabled.
  • Run single node job - 64 processors using case.submit
  • Run 4 node job for 1 hour to get timings.
  • Restart for x number of months - 4 hour jobs?

Timings

Single node run

65 timesteps in 40 mins using 64 processors on a single gpudev node.

Debug

Attempting 2. we are not running on a GPU - cannot find id 0 when count is 0 error. Attempting to change NGPUS_PER_NODE led to no change. Get division by zero because MAX_CPUTASKS_PER_GPU_NODE is set to 0 - error in L279 of /glade/u/home/matta/CAM/cime/CIME/case/case.py. Issue is that NGPUS_PER_NODE is not set. Therefore MAX_CPUTASKS_PER_GPU_NODE is unset. Manually forced by setting MAX_CPUTASKS_PER_GPU_NODE in env_mach_pes.xml and then NGPUS_PER_NODE would get set.

Debug 2

qstat -x -f 2449906 | sed -n '/exec_vnode/,/Resource_List/p' | tr '+' '\n' | sed 's/[()]//g'
shows that we have 4 gpus per node.
-l place=scatter:excl
mps
./xmlchange --subgroup case.run BATCH_COMMAND_FLAGS="-q main -l walltime=01:00:00 -A USTN0009 -l job_priority="regular" -l place=scatter:excl"
modify in config_batch.xml
unsetting CUDA_VISIBLE_DEVICES seemed to do the trick.

@ma595
Copy link
Collaborator Author

ma595 commented Aug 19, 2025

From https://ncar-hpc-docs.readthedocs.io/en/latest/compute-systems/derecho/#job-scripts

Single-socket nodes, 64 cores per socket
512 GB DDR4 memory per node
4 NVIDIA 1.41 GHz A100 Tensor Core GPUs per node
600 GB/s NVIDIA NVLink GPU interconnect

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant