- 
                Notifications
    
You must be signed in to change notification settings  - Fork 0
 
Add derecho-gpu machine config #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: current-0.082
Are you sure you want to change the base?
Conversation
| 
           Quick update. New  The fix involved removing the compilers file and adding the  
 I suspect that this xmlchange doesnt make sense.  | 
    
| 
           Returning to this after a break: Some useful tips for future: During development I tested with the following: Checked how to run resubmit jobs - I believe jack ran 6 monthly jobs? Trying again: Further observations 
 Approach 
 TimingsSingle node run65 timesteps in 40 mins using 64 processors on a single gpudev node. DebugAttempting 2. we are not running on a GPU - cannot find id 0 when count is 0 error. Attempting to change  Debug 2qstat -x -f 2449906 | sed -n '/exec_vnode/,/Resource_List/p' | tr '+' '\n' | sed 's/[()]//g'  | 
    
| 
           From https://ncar-hpc-docs.readthedocs.io/en/latest/compute-systems/derecho/#job-scripts  | 
    
| 
           I'd missed a few additional steps: Also noticed that ./xmlquery NGPUS_PER_NODE was set to 0  | 
    
           | 
    
This adds the
derecho-GPUmachine to theccs-config/machinesexternal which is a tagged release of https://github.com/ESMCI/ccs_config_cesm corresponding to https://github.com/ESMCI/ccs_config_cesm/releases/tag/ccs_config_cesm0.0.82This modifies:
config_batch.xmlconfig_machine.xmlconfig_compilers.xmlHardcoded paths toNETCDF_PATHandPNETCDF_PATH.The trick was to put the
derecho-gpumachine configuration above thederechomachine. This was because of theNODENAME_REGEXin derecho which autodetects the machine that is being run on. I suspect that commenting out this line might suffice.It's quite clear that there's a bit of a mismatch between the GPU configuration
*.xmlfiles provided (from Will Chapman) and the CIME build system we're using. More investigation is required. I copied only the relevant content from the files and the configuration for derecho's gpus seems to build, with a few caveats:Even after loading the following module set:
NETCDF could not be found. This is because in the jobs dir:
cmake_macros/CNL.cmakerequiresNETCDF_DIR.This is fixed by:
export NETCDF_DIR=/glade/u/apps/derecho/23.06/spack/opt/spack/netcdf/4.9.2/cray-mpich/8.1.25/oneapi/2023.0.0/wzolI also hardcoded some paths for debugging purposes.
Before merging:
config_compilers.xmlis used? This might be the way to fix the below.Test as follows: