Add derecho-gpu machine config #1

ma595 · 2025-07-23T10:25:27Z

This adds the derecho-GPU machine to the ccs-config/machines external which is a tagged release of https://github.com/ESMCI/ccs_config_cesm corresponding to https://github.com/ESMCI/ccs_config_cesm/releases/tag/ccs_config_cesm0.0.82

This modifies:

config_batch.xml
config_machine.xml
Adds config_compilers.xml Hardcoded paths to NETCDF_PATH and PNETCDF_PATH.

The trick was to put the derecho-gpu machine configuration above the derecho machine. This was because of the NODENAME_REGEX in derecho which autodetects the machine that is being run on. I suspect that commenting out this line might suffice.

It's quite clear that there's a bit of a mismatch between the GPU configuration *.xml files provided (from Will Chapman) and the CIME build system we're using. More investigation is required. I copied only the relevant content from the files and the configuration for derecho's gpus seems to build, with a few caveats:

Even after loading the following module set:

module --force purge
module load cesmdev/1.0 ncarenv/23.06 craype/2.7.20 intel/2023.0.0 mkl/2023.0.0 ncarcompilers/1.0.0
module load cmake/3.26.3 cray-mpich/8.1.25 hdf5-mpi/1.12.2 netcdf-mpi/4.9.2 parallel-netcdf/1.12.3
module load parallelio/2.6.2 esmf/8.6.0b04 cuda

NETCDF could not be found. This is because in the jobs dir:

cmake_macros/CNL.cmake requires NETCDF_DIR.

This is fixed by:

export NETCDF_DIR=/glade/u/apps/derecho/23.06/spack/opt/spack/netcdf/4.9.2/cray-mpich/8.1.25/oneapi/2023.0.0/wzol

I also hardcoded some paths for debugging purposes.

Before merging:

Fix hardcoded paths
Understand more about build system
Check whether config_compilers.xml is used? This might be the way to fix the below.
Test execution (unable to currently due to exhausting project GPU resources)

Test as follows:

source ./env.sh # with above environment modules loaded
cd ~/CAM/cime/scripts
./create_newcase --case ~/jobs/CAM_column_GPU-1 --compset FMTHIST --res ne30pg3_ne30pg3_mg17 --project USTN0009 --machine derecho-gpu
cd ~/jobs/CAM_column_GPU-1
export NETCDF_DIR=/glade/u/apps/derecho/23.06/spack/opt/spack/netcdf/4.9.2/cray-mpich/8.1.25/oneapi/2023.0.0/wzol 
./case.setup
./case.build

>>>grep -ri "config_compilers.xml" .

./env_case.xml:    <entry id="COMPILERS_SPEC_FILE" value="$SRCROOT/ccs_config/machines/config_compilers.xml">
./LockedFiles/env_case.xml:    <entry id="COMPILERS_SPEC_FILE" value="$SRCROOT/ccs_config/machines/config_compilers.xml">

ma595 · 2025-07-29T16:44:16Z

Quick update.

New derecho-gpu machine works with case.submit, during testing I did the following in jobs dir.

./xmlchange DEBUG=False # True is the default anyway
./xmlchange JOB_QUEUE=develop
./xmlchange JOB_WALLCLOCK_TIME=00:30:00

The fix involved removing the compilers file and adding the cmake/derecho-gpu.cmake file. The config_batch.xml has also been given some sensible defaults (CAM was silently failing initially indicating OOM errors - if memory is unspecified its default is too low). In theory, number of gpus can also be set with xmlchange NGPUS_PER_NODE=4, but this is yet to be tested. I modified gw_nlgw.F90 to use 4 gpus (a hardcoded value) initially, and this ran for 30 mins.

Test ./xmlchange NGPUS_PER_NODE=4

I suspect that this xmlchange doesnt make sense.

ma595 · 2025-08-18T16:08:28Z

Returning to this after a break:

Some useful tips for future:

During development I tested with the following:

qinteractive --nchunks 1 --ncpus 64 --ngpus 2 --mem 235G @derecho -A USTN0009 -l walltime=01:00:00

Checked how to run resubmit jobs - I believe jack ran 6 monthly jobs?
main queue should give you access to gpus
But it seems like I was previously running across 64 nodes with 1 process per node!

Trying again:
Hardcoded -ngpus=4 (this is what Will has done previously) - I tried to set NGPUS_PER_NODE as a configurable option but this led to issues. Setting this in the config_machines.xml leads to error.

Further observations

Negative refers to nodes. (-4) is 4 nodes.

./create_newcase --case ~/jobs/CAM_column_GPU-12 --compset FMTHIST --res ne30pg3_ne30pg3_mg17 --project USTN0009 --machine derecho
./xmlchange NTASKS=64
./xmlchange JOB_WALLCLOCK_TIME=00:45:00
./xmlchange JOB_QUEUE=main
./case.setup
./case.build # about 5 mins

Approach

Run interactive job on 64 processors on gpu node with Single column GW ML emulation enabled.
Run single node job - 64 processors using case.submit
Run 4 node job for 1 hour to get timings.
Restart for x number of months - 4 hour jobs?

Timings

Single node run

65 timesteps in 40 mins using 64 processors on a single gpudev node.

Debug

Attempting 2. we are not running on a GPU - cannot find id 0 when count is 0 error. Attempting to change NGPUS_PER_NODE led to no change. Get division by zero because MAX_CPUTASKS_PER_GPU_NODE is set to 0 - error in L279 of /glade/u/home/matta/CAM/cime/CIME/case/case.py. Issue is that NGPUS_PER_NODE is not set. Therefore MAX_CPUTASKS_PER_GPU_NODE is unset. Manually forced by setting MAX_CPUTASKS_PER_GPU_NODE in env_mach_pes.xml and then NGPUS_PER_NODE would get set.

Debug 2

qstat -x -f 2449906 | sed -n '/exec_vnode/,/Resource_List/p' | tr '+' '\n' | sed 's/[()]//g'
shows that we have 4 gpus per node.
-l place=scatter:excl
mps
./xmlchange --subgroup case.run BATCH_COMMAND_FLAGS="-q main -l walltime=01:00:00 -A USTN0009 -l job_priority="regular" -l place=scatter:excl"
modify in config_batch.xml
unsetting CUDA_VISIBLE_DEVICES seemed to do the trick.

ma595 · 2025-08-19T10:11:37Z

From https://ncar-hpc-docs.readthedocs.io/en/latest/compute-systems/derecho/#job-scripts

Single-socket nodes, 64 cores per socket
512 GB DDR4 memory per node
4 NVIDIA 1.41 GHz A100 Tensor Core GPUs per node
600 GB/s NVIDIA NVLink GPU interconnect

ma595 · 2025-09-05T17:02:09Z

I'd missed a few additional steps:
Needed to manually set MAX_CPUTASKS_PER_GPU_NODE in env_mach_pes.xml
otherwise this error occurred:

  File "/glade/u/home/matta/jobs/CAM_column_GPU-resubmit-scaling/./xmlchange", line 354, in <module>
    _main_func(__doc__)
  File "/glade/u/home/matta/jobs/CAM_column_GPU-resubmit-scaling/./xmlchange", line 338, in _main_func
    xmlchange(
  File "/glade/u/home/matta/jobs/CAM_column_GPU-resubmit-scaling/./xmlchange", line 279, in xmlchange
    with Case(caseroot, read_only=False, record=True, non_local=non_local) as case:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/glade/u/home/matta/CAM/cime/CIME/case/case.py", line 208, in __init__
    self.initialize_derived_attributes()
  File "/glade/u/home/matta/CAM/cime/CIME/case/case.py", line 293, in initialize_derived_attributes
    1, int(self.get_value("MAX_TASKS_PER_NODE") / max_mpitasks_per_node)

Also noticed that ./xmlquery NGPUS_PER_NODE was set to 0

ma595 · 2025-09-09T09:28:49Z

./xmlchange NGPUS_PER_NODE=4 # this will not set
./xmlchange NTASKS=256
./xmlchange MAX_CPUTASKS_PER_GPU_NODE=64
./xmlchange NGPUS_PER_NODE=4
./xmlchange MAX_GPUS_PER_NODE=4

ma595 added 2 commits July 23, 2025 04:20

Add derecho-gpu machine config

d0ce24a

Remove comment

83bb1a3

ma595 self-assigned this Jul 23, 2025

ma595 added 2 commits July 29, 2025 09:31

Remove config_compilers.xml and add derecho-gpu.cmake

74ec5dc

Set some sensible memory defaults for derecho-gpu

b649827

ma595 added 3 commits August 19, 2025 03:55

Fix formatting and add sensible defaults

5d54f6f

Change back to 64 tasks

41b32c3

Hardcode ngpus to 4

d6d079a

Add scatter - unlikely to be the fix

d444b38

Add ESMF and cesm_timing

fb0e35b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add derecho-gpu machine config #1

Add derecho-gpu machine config #1

Uh oh!

ma595 commented Jul 23, 2025 •

edited

Loading

Uh oh!

ma595 commented Jul 29, 2025 •

edited

Loading

Uh oh!

ma595 commented Aug 18, 2025 •

edited

Loading

Uh oh!

ma595 commented Aug 19, 2025

Uh oh!

ma595 commented Sep 5, 2025

Uh oh!

ma595 commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add derecho-gpu machine config #1

Are you sure you want to change the base?

Add derecho-gpu machine config #1

Uh oh!

Conversation

ma595 commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ma595 commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ma595 commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Timings

Single node run

Debug

Debug 2

Uh oh!

ma595 commented Aug 19, 2025

Uh oh!

ma595 commented Sep 5, 2025

Uh oh!

ma595 commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ma595 commented Jul 23, 2025 •

edited

Loading

ma595 commented Jul 29, 2025 •

edited

Loading

ma595 commented Aug 18, 2025 •

edited

Loading