feat: Azure Batch eagerly terminates jobs after all tasks have been submitted #6159

adamrtalbot · 2025-06-04T10:26:11Z

Azure Batch "job leak" is still an issue. This commit fixes #5839 which allows Nextflow to set jobs to auto terminate when all tasks have been submitted. This means that eventually jobs will move into terminated state even if something prevents nextflow reaching a graceful shutdown. Very early implementation and needs some refinement.

Code wise, this replaces terminateJobs, with setJobTermination, which sets the auto termination status of the jobs. This is called in a few places:

When the terminateJobs is called at graceful shutdown by Nextflow (old behaviour)
When setAutoJobTermination is called on a specific job

Then a TraceObserver is added for Azure Batch which will call setJobAutoTerminate when onProcessTerminate is called.

Very out of my depth in this part of the code base, so expect things to be wrong.

…ubmitted Azure Batch "job leak" is still an issue. This commit fixes #5839 which allows Nextflow to set jobs to auto terminate when all tasks have been submitted. This means that eventually jobs will move into terminated state even if something prevents nextflow reaching a graceful shutdown. Very early implementation and needs some refinement. Signed-off-by: adamrtalbot <[email protected]>

netlify · 2025-06-04T10:26:17Z

✅ Deploy Preview for nextflow-docs-staging ready!

Name	Link
🔨 Latest commit	`46c0b63`
🔍 Latest deploy log	https://app.netlify.com/projects/nextflow-docs-staging/deploys/684fe813ad7e920008fe201a
😎 Deploy Preview	https://deploy-preview-6159--nextflow-docs-staging.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

adamrtalbot · 2025-06-04T10:27:08Z

Example Nextflow pipeline:

process GREETING {
    executor 'azurebatch'
    container 'ubuntu:22.04'

    input:
        val greeting

    output:
        path "output.txt"

    script:
    """
    echo $greeting > output.txt
    """
}

process SLEEP {
    executor 'azurebatch'
    container 'ubuntu:22.04'

    input:
        path inputfile

    output:
        stdout

    script:
    """
    sleep 360
    cat ${inputfile}
    """
}

workflow  {
    Channel.of("hello", "bonjour", "gutentag")
    | GREETING
    | SLEEP
}

Signed-off-by: adamrtalbot <[email protected]>

adamrtalbot · 2025-06-04T10:40:04Z

plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzBatchProcessObserver.groovy

+        final executor = processor.executor as AzBatchExecutor
+        final batchService = executor.batchService
+
+        // Check if auto-termination is enabled
+        if( !batchService?.config?.batch()?.terminateJobsOnCompletion ) {
+            log.trace "Azure Batch job auto-termination is disabled, skipping eager termination for process: ${processor.name}"
+            return
+        }
+
+        // Find and set auto-termination for all jobs associated with this processor
+        batchService.allJobIds.findAll { key, jobId ->
+            key.processor == processor
+        }.values().each { jobId ->


This feels a bit convoluted to me...is there an easier way?

adamrtalbot · 2025-06-04T10:57:20Z

Integration tests failing, looks unrelated.

…etion

pditommaso · 2025-06-04T12:24:36Z

Retry now

adamrtalbot · 2025-06-04T15:15:06Z

Retry now

Done!

…etion

pditommaso · 2025-06-11T10:52:40Z

plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzBatchProcessObserver.groovy

+     * Sets Azure Batch jobs to auto-terminate when all tasks complete.
+     */
+    @Override
+    void onProcessTerminate(TaskProcessor processor) {


Why using an observer instead of having this logic in the task hander?

nextflow/plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzBatchTaskHandler.groovy

Line 127 in 08c0027

deleteTask(taskKey, task)

It's not a task, it's a job!

In Azure, job = queue.

See comment here: #3927 (comment)

Basically, it needs to wait until the last task of a process has been submitted to Az Batch, then make the job terminate after completion.

Fair, but I still don't see a big value using the trace observer. Would not make more sense to keep this in the cleanup logic here

https://github.com/nextflow-io/nextflow/blob/5839_azure_batch_jobs_terminate_upon_completion/plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzBatchService.groovy#L1065-L1080

all related metadata should be accessible in the AzBatchService object

That is how it currently works, see terminateJobsOnCompletion and deleteJobsOnCompletion. However it's causing us problems. If Nextflow dies, all jobs are left in active state which consumes a very limited quota. For some context, I had to argue with Azure support a lot to get 1000 active jobs as a quota, and with 1 process equaling 1 job you can use this up in a matter of days. Once you have done this, the only way to run Nextflow again is to go into your Azure Batch account and manually remove some jobs.

By aggressively going and setting jobs to automatically terminate, we can reduce the number of active jobs to all but the ones with running tasks when a Nextflow process dies. This is reducing the pressure on active jobs as much as possible. Between this and the 30 day cooldown, we should be at effectively zero active jobs for normal running which is what we should aim for.

Then not sure how much this approach will improve, most of times processes termination on pipeline completion, so if the execution is killed abruptly the behaviour will be more or less the same.

I wonder instead if it could be a problem with the cleanup execution. Are you able to replicate the problem? do you have any execution logs for effected pipelines ?

most of times processes termination on pipeline completion

Arrgghh that's frustrating.

do you have any execution logs for effected pipelines ?

Attached one from last night.

nf-5ZBk4BU4tGxbqU.log

It sounds like the earliest point at which we could safely terminate the azure job is right after all tasks for the corresponding process have been submitted to azure batch.

The TaskProcessor can signal when all tasks are "pending" (submitted to nextflow but not to azure) and when all tasks have completed, but not when all tasks are submitted to azure. I think the trace observer is the only way to do this, because it needs to watch the task submissions to figure out when all tasks have been submitted.

In fact, as I write this, even that isn't enough because you could have task retries. If I submit all the tasks, terminate the job, then a task fails and I need to retry it, I assume that wouldn't work if the job was already terminated? Therefore we actually do have to wait until all tasks are completed.

Damn, you're right. Back to the drawing board.

…etion

adamrtalbot added 2 commits June 4, 2025 11:29

Correct copywright notice

3cb3598

Signed-off-by: adamrtalbot <[email protected]>

Cody simplification and tidy

07c8a70

Signed-off-by: adamrtalbot <[email protected]>

adamrtalbot commented Jun 4, 2025

View reviewed changes

Merge branch 'master' into 5839_azure_batch_jobs_terminate_upon_compl…

cb2b7df

…etion

bentsherman added the executor/azure-batch label Jun 4, 2025

pditommaso force-pushed the master branch from 62ded97 to a0aa4dd Compare June 4, 2025 14:47

pditommaso force-pushed the master branch 2 times, most recently from b4b321e to 069653d Compare June 4, 2025 18:54

Merge branch 'master' into 5839_azure_batch_jobs_terminate_upon_compl…

9d93c91

…etion

pditommaso reviewed Jun 11, 2025

View reviewed changes

Merge branch 'master' into 5839_azure_batch_jobs_terminate_upon_compl…

46c0b63

…etion

bentsherman self-requested a review June 16, 2025 12:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Azure Batch eagerly terminates jobs after all tasks have been submitted #6159

feat: Azure Batch eagerly terminates jobs after all tasks have been submitted #6159

Uh oh!

adamrtalbot commented Jun 4, 2025 •

edited

Loading

Uh oh!

netlify bot commented Jun 4, 2025 •

edited

Loading

Uh oh!

adamrtalbot commented Jun 4, 2025

Uh oh!

adamrtalbot Jun 4, 2025

Uh oh!

adamrtalbot commented Jun 4, 2025

Uh oh!

pditommaso commented Jun 4, 2025

Uh oh!

adamrtalbot commented Jun 4, 2025

Uh oh!

pditommaso Jun 11, 2025

Uh oh!

adamrtalbot Jun 11, 2025

Uh oh!

adamrtalbot Jun 11, 2025

Uh oh!

pditommaso Jun 16, 2025

Uh oh!

adamrtalbot Jun 16, 2025 •

edited

Loading

Uh oh!

pditommaso Jun 16, 2025

Uh oh!

adamrtalbot Jun 17, 2025

Uh oh!

bentsherman Jun 17, 2025

Uh oh!

adamrtalbot Jun 18, 2025

Uh oh!

Uh oh!

feat: Azure Batch eagerly terminates jobs after all tasks have been submitted #6159

Are you sure you want to change the base?

feat: Azure Batch eagerly terminates jobs after all tasks have been submitted #6159

Uh oh!

Conversation

adamrtalbot commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for nextflow-docs-staging ready!

Uh oh!

adamrtalbot commented Jun 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adamrtalbot commented Jun 4, 2025

Uh oh!

pditommaso commented Jun 4, 2025

Uh oh!

adamrtalbot commented Jun 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adamrtalbot Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adamrtalbot commented Jun 4, 2025 •

edited

Loading

netlify bot commented Jun 4, 2025 •

edited

Loading

adamrtalbot Jun 16, 2025 •

edited

Loading