Skip to content

Azure can use preempted nodes within nodes #5545

Open
@adamrtalbot

Description

@adamrtalbot

New feature

See https://nextflow.slack.com/archives/C02TPRRLCF4/p1732549166359449 for context

This relates to my issue of accumulating pre-empted nodes https://nextflow.slack.com/archives/C02TPRRLCF4/p1729612651927509 above. With the help of Microsoft, I have fixed it with an amended scale formula, I'm surprised no one else seems to have had this issue since this is a common scaling pattern. Below is what they sent through in case it helps anyone:

The issue you are seeing is by design. The reason you are not seeing the nodes coming back is due to them not being yet available for recovery to your pool. I can see from the graph you shared that some nodes were recovered which occurred during those dips in the graph. This indicates that indeed some nodes are being recovered, but not fast enough which leads to more nodes being pre-empted than nodes recovering.
We can leverage the service variable $PreemptedNodeCount which should give you the total nodes that are pre-empted and then proceed to scale in those nodes. https://learn.microsoft.com/en-us/azure/batch/batch-automatic-scaling#preempted-nodes
I’ve modified your auto scale formula slightly:

$samples = $ActiveTasks.GetSamplePercent(TimeInterval_Minute * 15);
$tasks = $samples < 70 ? max(0,$ActiveTasks.GetSample(1)) : max( $ActiveTasks.GetSample(1), avg($ActiveTasks.GetSample(TimeInterval_Minute * 15)));
$maxTasksPerNode = 1;
$round = $maxTasksPerNode - 1;
$targetVMs = $tasks > 0 ? (($tasks + $round) / $maxTasksPerNode) : max(0, $TargetDedicated/2) + 0.5;
$preemptedVMs = floor(avg($PreemptedNodeCount.GetSample(180 * TimeInterval_Second)));
$TargetDedicatedNodes = max(0, min($targetVMs, 0));
$TargetLowPriorityNodes = max(0, (min($targetVMs, 80) - $preemptedVMs));
$NodeDeallocationOption = taskcompletion;

Basically, what the changes are doing is that it is taking the current pre-empted node count and storing the result to the $preemptedVMs variable. Then, when calculating the low prio nodes to allocate, it is taking the minimum between the $targetVMs and 80. In the previous results, this would be 80.
Lets say that you have 10 pre-empted nodes, then the target low priority nodes will be 70 and your pool will de-allocate 10 nodes during this auto scale evaluation.
Which nodes it deallocates is based on the $NodeDeallocationOption. Since you have it set to taskcompletion, it will deallocate nodes that do not have any tasks scheduled regardless if they are pre-empted or not.
So, if you have low prio nodes that do not have any tasks scheduled, then it will deallocate those as well. In your previous example, all active low prio nodes had tasks running, so the pre-empted nodes would be the first to be deallocated.
The next auto scale evaluation, if no additional nodes were pre-empted, then the $preemptedVMs should be 0 and the target low priority nodes calculation should be 80, so Batch will allocate another 10 low prio nodes.

Usage scenario

Use preempted nodes by default, which are cheaper than spot or dedicated nodes.

Suggest implementation

Modify the autoscale formula to reflect the changes above.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions