-
Notifications
You must be signed in to change notification settings - Fork 695
feat: Azure Batch eagerly terminates jobs after all tasks have been submitted #6159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat: Azure Batch eagerly terminates jobs after all tasks have been submitted #6159
Conversation
…ubmitted Azure Batch "job leak" is still an issue. This commit fixes #5839 which allows Nextflow to set jobs to auto terminate when all tasks have been submitted. This means that eventually jobs will move into terminated state even if something prevents nextflow reaching a graceful shutdown. Very early implementation and needs some refinement. Signed-off-by: adamrtalbot <[email protected]>
✅ Deploy Preview for nextflow-docs-staging ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Example Nextflow pipeline: process GREETING {
executor 'azurebatch'
container 'ubuntu:22.04'
input:
val greeting
output:
path "output.txt"
script:
"""
echo $greeting > output.txt
"""
}
process SLEEP {
executor 'azurebatch'
container 'ubuntu:22.04'
input:
path inputfile
output:
stdout
script:
"""
sleep 360
cat ${inputfile}
"""
}
workflow {
Channel.of("hello", "bonjour", "gutentag")
| GREETING
| SLEEP
} |
Signed-off-by: adamrtalbot <[email protected]>
Signed-off-by: adamrtalbot <[email protected]>
final executor = processor.executor as AzBatchExecutor | ||
final batchService = executor.batchService | ||
|
||
// Check if auto-termination is enabled | ||
if( !batchService?.config?.batch()?.terminateJobsOnCompletion ) { | ||
log.trace "Azure Batch job auto-termination is disabled, skipping eager termination for process: ${processor.name}" | ||
return | ||
} | ||
|
||
// Find and set auto-termination for all jobs associated with this processor | ||
batchService.allJobIds.findAll { key, jobId -> | ||
key.processor == processor | ||
}.values().each { jobId -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a bit convoluted to me...is there an easier way?
Integration tests failing, looks unrelated. |
Retry now |
Done! |
b4b321e
to
069653d
Compare
* Sets Azure Batch jobs to auto-terminate when all tasks complete. | ||
*/ | ||
@Override | ||
void onProcessTerminate(TaskProcessor processor) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why using an observer instead of having this logic in the task hander?
nextflow/plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzBatchTaskHandler.groovy
Line 127 in 08c0027
deleteTask(taskKey, task) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a task, it's a job!
In Azure, job = queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment here: #3927 (comment)
Basically, it needs to wait until the last task of a process has been submitted to Az Batch, then make the job terminate after completion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair, but I still don't see a big value using the trace observer. Would not make more sense to keep this in the cleanup logic here
all related metadata should be accessible in the AzBatchService object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is how it currently works, see terminateJobsOnCompletion and deleteJobsOnCompletion. However it's causing us problems. If Nextflow dies, all jobs are left in active state which consumes a very limited quota. For some context, I had to argue with Azure support a lot to get 1000 active jobs as a quota, and with 1 process equaling 1 job you can use this up in a matter of days. Once you have done this, the only way to run Nextflow again is to go into your Azure Batch account and manually remove some jobs.
By aggressively going and setting jobs to automatically terminate, we can reduce the number of active jobs to all but the ones with running tasks when a Nextflow process dies. This is reducing the pressure on active jobs as much as possible. Between this and the 30 day cooldown, we should be at effectively zero active jobs for normal running which is what we should aim for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then not sure how much this approach will improve, most of times processes termination on pipeline completion, so if the execution is killed abruptly the behaviour will be more or less the same.
I wonder instead if it could be a problem with the cleanup execution. Are you able to replicate the problem? do you have any execution logs for effected pipelines ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most of times processes termination on pipeline completion
Arrgghh that's frustrating.
do you have any execution logs for effected pipelines ?
Attached one from last night.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like the earliest point at which we could safely terminate the azure job is right after all tasks for the corresponding process have been submitted to azure batch.
The TaskProcessor can signal when all tasks are "pending" (submitted to nextflow but not to azure) and when all tasks have completed, but not when all tasks are submitted to azure. I think the trace observer is the only way to do this, because it needs to watch the task submissions to figure out when all tasks have been submitted.
In fact, as I write this, even that isn't enough because you could have task retries. If I submit all the tasks, terminate the job, then a task fails and I need to retry it, I assume that wouldn't work if the job was already terminated? Therefore we actually do have to wait until all tasks are completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Damn, you're right. Back to the drawing board.
Azure Batch "job leak" is still an issue. This commit fixes #5839 which allows Nextflow to set jobs to auto terminate when all tasks have been submitted. This means that eventually jobs will move into terminated state even if something prevents nextflow reaching a graceful shutdown. Very early implementation and needs some refinement.
Code wise, this replaces
terminateJobs
, withsetJobTermination
, which sets the auto termination status of the jobs. This is called in a few places:terminateJobs
is called at graceful shutdown by Nextflow (old behaviour)setAutoJobTermination
is called on a specific jobThen a TraceObserver is added for Azure Batch which will call
setJobAutoTerminate
whenonProcessTerminate
is called.Very out of my depth in this part of the code base, so expect things to be wrong.