-
Notifications
You must be signed in to change notification settings - Fork 0
Add derecho-gpu machine config #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: current-0.082
Are you sure you want to change the base?
Conversation
Quick update. New
The fix involved removing the compilers file and adding the
I suspect that this xmlchange doesnt make sense. |
Returning to this after a break: Some useful tips for future: During development I tested with the following:
Checked how to run resubmit jobs - I believe jack ran 6 monthly jobs? Trying again: Further observations
Approach
TimingsSingle node run65 timesteps in 40 mins using 64 processors on a single gpudev node. DebugAttempting 2. we are not running on a GPU - cannot find id 0 when count is 0 error. Attempting to change Debug 2qstat -x -f 2449906 | sed -n '/exec_vnode/,/Resource_List/p' | tr '+' '\n' | sed 's/[()]//g' |
From https://ncar-hpc-docs.readthedocs.io/en/latest/compute-systems/derecho/#job-scripts
|
This adds the
derecho-GPU
machine to theccs-config/machines
external which is a tagged release of https://github.com/ESMCI/ccs_config_cesm corresponding to https://github.com/ESMCI/ccs_config_cesm/releases/tag/ccs_config_cesm0.0.82This modifies:
config_batch.xml
config_machine.xml
config_compilers.xml
Hardcoded paths toNETCDF_PATH
andPNETCDF_PATH
.The trick was to put the
derecho-gpu
machine configuration above thederecho
machine. This was because of theNODENAME_REGEX
in derecho which autodetects the machine that is being run on. I suspect that commenting out this line might suffice.It's quite clear that there's a bit of a mismatch between the GPU configuration
*.xml
files provided (from Will Chapman) and the CIME build system we're using. More investigation is required. I copied only the relevant content from the files and the configuration for derecho's gpus seems to build, with a few caveats:Even after loading the following module set:
NETCDF could not be found. This is because in the jobs dir:
cmake_macros/CNL.cmake
requiresNETCDF_DIR
.This is fixed by:
export NETCDF_DIR=/glade/u/apps/derecho/23.06/spack/opt/spack/netcdf/4.9.2/cray-mpich/8.1.25/oneapi/2023.0.0/wzol
I also hardcoded some paths for debugging purposes.
Before merging:
config_compilers.xml
is used? This might be the way to fix the below.Test as follows: