This is the repo for the tutorial of runing tasks on Rockfish. Rockfish is a community-shared cluster, i.e., everyone can use any nodes within the limits of their allocated utilization. Every quarter you request a certain amount of GPU hours (i.e., number of cores x wall time) and provide justification for them. These assigned limits will be reset on a quarterly basis (use it or lose it).
To start, create an account here with your JHED.
IMPORTANT:
- It is important for you to use JHED otherwise the system will have difficulty authenticating you.
- If you're a student, you need to get access through your PI (usually you advisor).
test_gpusshows how to run a simple script on the GPU processor. You can runsbatch test_gpus.shto submit the job and make sure that you have access to GPU nodes.classification-exampleshows how to run a simple classification task on the GPU processor. You can runsbatch train_on_rockfish.shto train a classifier on Rockfish.
- How do I find what the queue of a partition?
- Type the command
sinfo -sto get a list of the partitions. sinfo -p partition-namewill display the utilization for this partition.
- Type the command
- How do I know what nodes are available?
- Type the command
sinfo -Nto get a list of the nodes. sinfo -N -p partition-namewill display the utilization for this partition.
- Type the command
- How do I interpret "states"?
idle: The node is available for use.alloc: The node is currently being used by a job.mix: Some of the node's processors are currently being used by a job.- Further details here: https://slurm.schedmd.com/sinfo.html
- How do I see how many GPUs are available on each node?
- Type the command
sinfo -N -p a100to get a list of the nodes and the number of GPUs available on each node.
- Type the command
- How do I submit multiple jobs with this different parameters?
- See this page: https://www.osc.edu/book/export/html/4046
- How do I submit a job to a specific node?
- You can submit a job to a specific node by using the
--nodelistflag. - For example,
sbatch --nodelist=node1 job.shwill submit the job to node1. - You can also use the
--excludeflag to exclude a node. - For example,
sbatch --exclude=node1 job.shwill submit the job to any node except node1.
- You can submit a job to a specific node by using the
- How do I create interactive session?
- You can create an interactive session by using the
salloccommand. - For example,
salloc -p a100 --gres=gpu:1 --time=00:30:00will create an interactive session with 1 GPU for 30 minutes. - You can also
interactcommand which internally makes call tosalloccommand.
- You can create an interactive session by using the
- How do I see the queue of my jobs?
- You can see the queue of your jobs by using the
squeuecommand. - You can specialize it for a partition:
squeue -p a100 - For example,
squeue -u useridwill show the queue of your jobs.
- You can see the queue of your jobs by using the
- What do status labels mean in the output of
squeue?PD: Pending. The job is awaiting resource allocation.R: Running. The job currently has an allocation.CG: Completing. The job is in the process of completing. Some processes on some nodes may still be active.CD: Completed. The job has terminated all processes on all nodes.F: Failed. The job terminated with non-zero exit code or other failure condition.TO: Timeout. The job terminated upon reaching its time limit.NF: Node Failure. The job terminated due to failure of one or more allocated nodes.CA: Canceled. The job was explicitly canceled by the user or system administrator. The job may or may not have been initiated.SE: Special Exit. The job was requeued in a special state by the scheduler; it may or may not have been initiated.ST: Suspended. The job has an allocation, but execution has been suspended and CPUs have been released for other jobs.S: Suspended by user. The job has an allocation, but execution has been suspended and CPUs have been released for other jobs at the request of the user.PR: Preempted. The job was preempted.
- How do I cancel a job?
- You can cancel a job by using the
scancelcommand. - For example,
scancel jobidwill cancel the job with the given jobid.
- You can cancel a job by using the
- How do I check the statistics of a finished job?
- You can check the statistics of a finished job by using the
sacctcommand. - For example,
sacct -j jobidwill show the statistics of the job with the given jobid.
- You can check the statistics of a finished job by using the
- Tracking the accounts/usage: https://coldfront.rockfish.jhu.edu/
- Login node:
ssh [email protected] - Help desk: [email protected] if you face any issues, email these folks! :)
- User Guide: https://www.arch.jhu.edu/access/user-guide/
- System configuration: https://www.arch.jhu.edu/about-rockfish/system-configuration/
- Tutorials: https://marcc.readthedocs.io/
- FAQs: https://www.arch.jhu.edu/access/faq/
- A useful collection of slides on Rockfish.
- Debugging with TotalView
If you struggle with SLURM, these are useful cheatsheets: