-
Notifications
You must be signed in to change notification settings - Fork 12
Retry SLURM job submission #58
base: v1_18_bosco
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -85,8 +85,17 @@ bls_add_job_wrapper | |
| ############################################################### | ||
|
|
||
| datenow=`date +%Y%m%d` | ||
| jobID=`${slurm_binpath}/sbatch $bls_tmp_file` # actual submission | ||
| retcode=$? | ||
| retry=0 | ||
| MAX_RETRY=3 | ||
| until [ $retry -eq $MAX_RETRY ] ; do | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks like the number of attempts to submit is 3 but the number of retries is actually 2, right?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The first attempt is
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea, that's 3 tries total but only 2 retries so we should call the variable |
||
| jobID=$(${slurm_binpath}/sbatch $bls_tmp_file) | ||
| retcode=$? | ||
| if [ "$retcode" == "0" ] ; then | ||
| break | ||
| fi | ||
| retry=$[$retry+1] | ||
| sleep 10 | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could we make the sleep backoff exponentially?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can do. |
||
| done | ||
|
|
||
| if [ "$retcode" != "0" ] ; then | ||
| rm -f $bls_tmp_file | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add this as a config variable
slurm_max_submit_retrieshere, defaulting to 0, and reference it via${slurm_max_submit_retries}?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK