Skip to content

feat: Azure Batch eagerly terminates jobs after all tasks have been submitted #6159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

adamrtalbot
Copy link
Collaborator

@adamrtalbot adamrtalbot commented Jun 4, 2025

Azure Batch "job leak" is still an issue. This commit fixes #5839 which allows Nextflow to set jobs to auto terminate when all tasks have been submitted. This means that eventually jobs will move into terminated state even if something prevents nextflow reaching a graceful shutdown. Very early implementation and needs some refinement.

Code wise, this replaces terminateJobs, with setJobTermination, which sets the auto termination status of the jobs. This is called in a few places:

  • When the terminateJobs is called at graceful shutdown by Nextflow (old behaviour)
  • When setAutoJobTermination is called on a specific job

Then a TraceObserver is added for Azure Batch which will call setJobAutoTerminate when onProcessTerminate is called.

Very out of my depth in this part of the code base, so expect things to be wrong.

…ubmitted

Azure Batch "job leak" is still an issue. This commit fixes #5839 which allows Nextflow to set jobs to auto terminate when all tasks have been submitted. This means that eventually jobs will move into terminated state even if something prevents nextflow reaching a graceful shutdown. Very early implementation and needs some refinement.

Signed-off-by: adamrtalbot <[email protected]>
Copy link

netlify bot commented Jun 4, 2025

Deploy Preview for nextflow-docs-staging ready!

Name Link
🔨 Latest commit 46c0b63
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/684fe813ad7e920008fe201a
😎 Deploy Preview https://deploy-preview-6159--nextflow-docs-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@adamrtalbot
Copy link
Collaborator Author

Example Nextflow pipeline:

process GREETING {
    executor 'azurebatch'
    container 'ubuntu:22.04'

    input:
        val greeting

    output:
        path "output.txt"

    script:
    """
    echo $greeting > output.txt
    """
}

process SLEEP {
    executor 'azurebatch'
    container 'ubuntu:22.04'

    input:
        path inputfile

    output:
        stdout

    script:
    """
    sleep 360
    cat ${inputfile}
    """
}

workflow  {
    Channel.of("hello", "bonjour", "gutentag")
    | GREETING
    | SLEEP
}

Comment on lines +52 to +64
final executor = processor.executor as AzBatchExecutor
final batchService = executor.batchService

// Check if auto-termination is enabled
if( !batchService?.config?.batch()?.terminateJobsOnCompletion ) {
log.trace "Azure Batch job auto-termination is disabled, skipping eager termination for process: ${processor.name}"
return
}

// Find and set auto-termination for all jobs associated with this processor
batchService.allJobIds.findAll { key, jobId ->
key.processor == processor
}.values().each { jobId ->
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit convoluted to me...is there an easier way?

@adamrtalbot
Copy link
Collaborator Author

Integration tests failing, looks unrelated.

@pditommaso
Copy link
Member

Retry now

@adamrtalbot
Copy link
Collaborator Author

Retry now

Done!

@pditommaso pditommaso force-pushed the master branch 2 times, most recently from b4b321e to 069653d Compare June 4, 2025 18:54
* Sets Azure Batch jobs to auto-terminate when all tasks complete.
*/
@Override
void onProcessTerminate(TaskProcessor processor) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why using an observer instead of having this logic in the task hander?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a task, it's a job!

In Azure, job = queue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment here: #3927 (comment)

Basically, it needs to wait until the last task of a process has been submitted to Az Batch, then make the job terminate after completion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, but I still don't see a big value using the trace observer. Would not make more sense to keep this in the cleanup logic here

https://github.com/nextflow-io/nextflow/blob/5839_azure_batch_jobs_terminate_upon_completion/plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzBatchService.groovy#L1065-L1080

all related metadata should be accessible in the AzBatchService object

Copy link
Collaborator Author

@adamrtalbot adamrtalbot Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is how it currently works, see terminateJobsOnCompletion and deleteJobsOnCompletion. However it's causing us problems. If Nextflow dies, all jobs are left in active state which consumes a very limited quota. For some context, I had to argue with Azure support a lot to get 1000 active jobs as a quota, and with 1 process equaling 1 job you can use this up in a matter of days. Once you have done this, the only way to run Nextflow again is to go into your Azure Batch account and manually remove some jobs.

By aggressively going and setting jobs to automatically terminate, we can reduce the number of active jobs to all but the ones with running tasks when a Nextflow process dies. This is reducing the pressure on active jobs as much as possible. Between this and the 30 day cooldown, we should be at effectively zero active jobs for normal running which is what we should aim for.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then not sure how much this approach will improve, most of times processes termination on pipeline completion, so if the execution is killed abruptly the behaviour will be more or less the same.

I wonder instead if it could be a problem with the cleanup execution. Are you able to replicate the problem? do you have any execution logs for effected pipelines ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of times processes termination on pipeline completion

Arrgghh that's frustrating.

do you have any execution logs for effected pipelines ?

Attached one from last night.

nf-5ZBk4BU4tGxbqU.log

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like the earliest point at which we could safely terminate the azure job is right after all tasks for the corresponding process have been submitted to azure batch.

The TaskProcessor can signal when all tasks are "pending" (submitted to nextflow but not to azure) and when all tasks have completed, but not when all tasks are submitted to azure. I think the trace observer is the only way to do this, because it needs to watch the task submissions to figure out when all tasks have been submitted.

In fact, as I write this, even that isn't enough because you could have task retries. If I submit all the tasks, terminate the job, then a task fails and I need to retry it, I assume that wouldn't work if the job was already terminated? Therefore we actually do have to wait until all tasks are completed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn, you're right. Back to the drawing board.

@bentsherman bentsherman self-requested a review June 16, 2025 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nextflow eagerly terminates Azure Batch jobs during execution
3 participants