Feature (opt-in): Gang Aware Preemption #4637

mvinchoo · 2025-09-24T06:04:28Z

What type of PR is this?

Feature: Gang-aware preemption modes for AI/ML training and other gang-scheduled workloads (PodGroup/VCJob).

What this PR does / why we need it:

Today, preemption picks nodes first and only then decides which tasks to evict, without considering gang (PodGroup/VCJob) boundaries. As a result, victims can span multiple jobs, and the Gang plugin ends up acting as a guard-rail (via minMember) rather than guiding which gang should be preempted. This can yield partial/fragmented evictions that are sub-optimal for AI/ML jobs.

This PR introduces an opt-in gang-aware preemption strategy that selects victims at the gang level. It reduces cross-job churn, preserves job atomicity when desired, and better aligns with workloads that need “all-or-nothing” capacity.

Modes (policy):

disabled (default): Existing behavior. Preemption remains node-first and task-level; no gang semantics applied.
minimal: Preemption will group preemptable tasks on each node into gangs then
- On each candidate node, group preemptable tasks by gang.
- Prefer a single gang whose eviction satisfies the preemptor task’s request (most cases).
- If no single gang suffices, fall back to a greedy gang selection (still within gang units) until the preemptor fits (rare).
atomic: Same selection as minimal, but if a victim gang is chosen, evict all of its tasks across all nodes (cluster-wide gang eviction). This matches AI/ML expectations where partial eviction often dooms the job anyway.
Fewer cross-job evictions: Victims come from a coherent gang, not scattered tasks across jobs.
Predictability for gang jobs: Respects “start/stop together” intent; atomic ensures clean state.
Backwards compatible: Fully opt-in; disabled preserves current behavior.

Which issue(s) this PR fixes:

Partially fixes #4607.
In addition to the changes discussed in the above issue, this PR also adds:

Handles node fragmentation.
It also adds atomic mode.

Happy to create a new issue if needed.

Special notes for your reviewer:

Objective:
Minimize the number of gangs disrupted on a node without hurting scheduler throughput.
Why not try all k (1..J) combinations:
Per node combos = sum_{t=1..J} C(J, t) = 2^J − 1 (exponential)
Exponential time is unacceptable in a high-throughput scheduler like Volcano.
Two-phase heuristic (fast, practical):
- Phase 1 – Single-gang fit (minimal overage):
  If any single gang can satisfy the preemptor’s need on a node, choose the best one (least overage).
  Rationale: in most cases, preempting one gang frees enough capacity for the incoming task.
- Phase 2 – Greedy multi-gang (minimize count, then overage):
  If Phase 1 fails (e.g., preemptor is large), sort candidate gangs by size (largest first) and accumulate until the need is met.
  Complexity: O(J log J) per node (sort) + O(J) accumulate — suitable for high QPS.
Trade-off example (1D, non-negative):
Preemptor need: 80
Node gangs: [50, 50, 50, 50, 20, 20, 20, 20]
Greedy (fast): 50 + 50 = 100 → meets need with +20 overage
Optimal overage (exhaustive search): 20 + 20 + 20 + 20 = 80 → 0 overage
Conclusion:
We accept small overage to preserve throughput and avoid exponential search.
Avoids the 2^J blow-up while keeping decisions near-optimal in practice.

Does this PR introduce a user-facing change?

Yes

The following block needs to be added to `volcano-scheduler-configmap`
data:
  volcano-scheduler.conf: |
    actions: "...preempt..."
    configurations:
    - name: preempt
      arguments:
        gangPreemptionMode: off  # Accepted values: off (default) | minimal | atomic

volcano-sh-bot · 2025-09-24T06:04:32Z

Welcome @mvinchoo!

It looks like this is your first PR to volcano-sh/volcano.

Thank you, and welcome to Volcano. 😃

gemini-code-assist · 2025-09-24T06:04:41Z

Summary of Changes

Hello @mvinchoo, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces an optional, lightweight strict gang preemption feature to the scheduler. This enhancement addresses issues in AI/ML workloads where partial evictions can lead to job stalls or failures. By enabling this feature, if a single task within a gang is chosen for preemption, the entire gang will be evicted, ensuring that resources are fully cleared for higher-priority tasks. This approach aims to improve overall cluster efficiency and job progress while maintaining a minimal impact on scheduling latency and code risk.

Highlights

Strict Gang Preemption: Introduces an opt-in mechanism for strict gang preemption, where if any task from a gang (PodGroup/Job) is selected as a victim, the entire gang is preempted.
Improved Workload Stability: Aims to prevent cascading failures in AI/ML gang workloads caused by partial evictions and improves the chance of higher-priority jobs making progress by freeing coherent capacity.
Configuration: The feature is opt-in and can be enabled by setting enableStrictGangPreemption: true in the volcano-scheduler-configmap under the preempt configuration.
Minimal Impact: The implementation is intentionally lightweight, preserves current behavior by default, and adapts gang semantics to the existing node-based preemption pipeline.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces an opt-in feature for strict gang preemption, which is a valuable addition for workloads where partial preemption is undesirable. The implementation correctly adds the configuration and integrates it into the preemption action. My review identified a critical issue in the gang preemption logic that could lead to double-counting of resources, as well as a minor redundancy and an issue with evicting tasks in terminal states. I've provided detailed comments and suggestions to address these points. With these fixes, the feature should work as intended.

pkg/scheduler/actions/preempt/preempt.go

volcano-sh-bot · 2025-09-24T06:18:05Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign wangyang0616
You can assign the PR to them by writing /assign @wangyang0616 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/scheduler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Mohit Vinchoo <[email protected]>

mvinchoo · 2025-09-25T01:35:43Z

/assign @wangyang0616

JesseStutler · 2025-09-26T07:08:21Z

/cc Will take a look after v1.13.0 released

mvinchoo · 2025-09-26T07:28:40Z

/cc Will take a look after v1.13.0 released

Thank you! @JesseStutler

Signed-off-by: Mohit Vinchoo <[email protected]>

mvinchoo · 2025-10-17T02:44:51Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a valuable gang-aware preemption feature with minimal and atomic modes. The implementation is well-structured and follows a sensible two-phase heuristic for victim selection. The accompanying tests are comprehensive and cover various scenarios for the new modes.

My review focuses on ensuring correctness and improving maintainability. I've identified a critical issue where gangPreempt may return an incorrect status, potentially misleading the scheduler. I've also found opportunities to simplify logic, remove redundant code, and improve readability. Additionally, there's a duplicated test case that should be removed.

Overall, this is a great addition. Addressing these points will make the implementation more robust and easier to maintain.

gemini-code-assist · 2025-10-17T02:47:17Z

pkg/scheduler/actions/preempt/preempt.go

+	// If this check fails, it implies some Evictions failed.
+	// Since we are optimizing for gangs per node we should try again in next session
+	if ssn.Allocatable(currentQueue, preemptor) && preemptor.InitResreq.LessEqual(targetNode.FutureIdle(), api.Zero) {
+		if err := stmt.Pipeline(preemptor, targetNode.Name, !preempted.IsEmpty()); err != nil {
+			klog.Errorf("Failed to pipeline Task <%s/%s> on Node <%s>",
+				preemptor.Namespace, preemptor.Name, targetNode.Name)
+			if rollbackErr := stmt.UnPipeline(preemptor); rollbackErr != nil {
+				klog.Errorf("Failed to unpipeline Task %v on %v in Session %v for %v.",
+					preemptor.UID, targetNode.Name, ssn.UID, rollbackErr)
+			}
+		}
+	}
+	return true, nil
+}


The function gangPreempt may return true even when the preemptor task is not successfully pipelined. This is inconsistent with normalPreempt and could cause the scheduler to behave incorrectly, for example by not retrying to schedule a job that could have been scheduled. The function should only return true if the preemption leads to a successful allocation of the preemptor task.

To fix this, you can introduce an assigned boolean variable, similar to how it's used in normalPreempt, to track the outcome of the Pipeline operation and return its value at the end.

assigned := false // If this check fails, it implies some Evictions failed. // Since we are optimizing for gangs per node we should try again in next session if ssn.Allocatable(currentQueue, preemptor) && preemptor.InitResreq.LessEqual(targetNode.FutureIdle(), api.Zero) { if err := stmt.Pipeline(preemptor, targetNode.Name, !preempted.IsEmpty()); err != nil { klog.Errorf("Failed to pipeline Task <%s/%s> on Node <%s>", preemptor.Namespace, preemptor.Name, targetNode.Name) if rollbackErr := stmt.UnPipeline(preemptor); rollbackErr != nil { klog.Errorf("Failed to unpipeline Task %v on %v in Session %v for %v.", preemptor.UID, targetNode.Name, ssn.UID, rollbackErr) } } else { assigned = true } } return assigned, nil }

gemini-code-assist · 2025-10-17T02:47:17Z

pkg/scheduler/actions/preempt/preempt.go

+			if filter == nil {
+				preemptees = append(preemptees, task.Clone())
+			} else if filter(task) {
+				preemptees = append(preemptees, task.Clone())
+			}


This conditional logic can be simplified. Both branches append task.Clone() to preemptees. You can combine them into a single if statement with an || operator for better readability.

if filter == nil || filter(task) { preemptees = append(preemptees, task.Clone()) }

gemini-code-assist · 2025-10-17T02:47:17Z

pkg/scheduler/actions/preempt/preempt.go

+				if t.Job == preemptor.Job {
+					continue
+				}


This check is redundant. The nodeJobVictimsMap is built from preemptees, which are already filtered in gangPreempt to exclude tasks from the preemptor's own job. Removing this unnecessary check will make the code cleaner.

gemini-code-assist · 2025-10-17T02:47:18Z

pkg/scheduler/actions/preempt/preempt.go

+		sort.Slice(cand, func(i, j int) bool {
+			return !cand[i].res.LessEqual(cand[j].res, api.Zero)
+		})


The sorting logic !cand[i].res.LessEqual(cand[j].res, api.Zero) is functionally correct for a descending sort, but it's not very readable. Using cand[j].res.Less(cand[i].res, api.Zero) would be more idiomatic and easier to understand, as it clearly expresses "is j less than i?".

sort.Slice(cand, func(i, j int) bool { return cand[j].res.Less(cand[i].res, api.Zero) })

gemini-code-assist · 2025-10-17T02:47:18Z

pkg/scheduler/actions/preempt/preempt_test.go

+		{
+			Name: "pick the gang with the least overage for minimal mode deterministic nodes for tied overage",
+			PodGroups: []*schedulingv1beta1.PodGroup{
+				util.BuildPodGroupWithPrio("pg1", "c1", "q1", 0, map[string]int32{}, schedulingv1beta1.PodGroupInqueue, "low-priority"),
+				util.BuildPodGroupWithPrio("pg2", "c1", "q1", 0, map[string]int32{}, schedulingv1beta1.PodGroupInqueue, "low-priority"),
+				util.BuildPodGroupWithPrio("pg3", "c1", "q1", 1, map[string]int32{}, schedulingv1beta1.PodGroupInqueue, "high-priority"),
+			},
+			Pods: []*v1.Pod{
+				util.BuildPod("c1", "preemptee1", "n1", v1.PodRunning, api.BuildResourceList("1", "1G"), "pg1", map[string]string{schedulingv1beta1.PodPreemptable: "true"}, make(map[string]string)),
+				util.BuildPod("c1", "preemptee2", "n1", v1.PodRunning, api.BuildResourceList("1", "1G"), "pg1", map[string]string{schedulingv1beta1.PodPreemptable: "false"}, make(map[string]string)),
+				util.BuildPod("c1", "preemptee3", "n1", v1.PodRunning, api.BuildResourceList("1", "1G"), "pg1", map[string]string{schedulingv1beta1.PodPreemptable: "true"}, make(map[string]string)),
+				util.BuildPod("c1", "preemptee4", "n2", v1.PodRunning, api.BuildResourceList("1", "1G"), "pg2", map[string]string{schedulingv1beta1.PodPreemptable: "true"}, make(map[string]string)),
+				util.BuildPod("c1", "preemptee5", "n2", v1.PodRunning, api.BuildResourceList("1", "1G"), "pg2", map[string]string{schedulingv1beta1.PodPreemptable: "true"}, make(map[string]string)),
+				util.BuildPod("c1", "preemptor1", "", v1.PodPending, api.BuildResourceList("2", "2G"), "pg3", make(map[string]string), make(map[string]string)),
+			},
+			Nodes: []*v1.Node{
+				util.BuildNode("n1", api.BuildResourceList("3", "3G", []api.ScalarResource{{Name: "pods", Value: "10"}}...), make(map[string]string)),
+				util.BuildNode("n2", api.BuildResourceList("2", "2G", []api.ScalarResource{{Name: "pods", Value: "10"}}...), make(map[string]string)),
+			},
+			Queues: []*schedulingv1beta1.Queue{
+				util.BuildQueue("q1", 1, nil),
+			},
+			ExpectEvicted:  []string{"c1/preemptee1", "c1/preemptee3"},
+			ExpectEvictNum: 2,
+		},


This test case, "pick the gang with the least overage for minimal mode deterministic nodes for tied overage", is a duplicate of the one defined at lines 650-674. Please remove this redundant test case.

volcano-sh-bot requested review from wangyang0616 and william-wang September 24, 2025 06:04

volcano-sh-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 24, 2025

mvinchoo mentioned this pull request Sep 24, 2025

Volcano scheduler doesn’t reclaim/preempt jobs as a gang. #4607

Open

gemini-code-assist bot reviewed Sep 24, 2025

View reviewed changes

pkg/scheduler/actions/preempt/preempt.go Outdated Show resolved Hide resolved

pkg/scheduler/actions/preempt/preempt.go Outdated Show resolved Hide resolved

mvinchoo force-pushed the mvinchoo/gangpreempt branch 2 times, most recently from 2f5f616 to 4370c83 Compare September 24, 2025 06:17

volcano-sh-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 24, 2025

volcano-sh-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 24, 2025

mvinchoo force-pushed the mvinchoo/gangpreempt branch 2 times, most recently from 9f63988 to 0a5592a Compare September 24, 2025 06:32

volcano-sh-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 24, 2025

Mohit Vinchoo added 6 commits September 23, 2025 23:59

Preempt: Add lightweight strict gang preemption (opt-in)

caf3e95

Signed-off-by: Mohit Vinchoo <[email protected]>

Remove double check

232ef0d

Signed-off-by: Mohit Vinchoo <[email protected]>

Fix potential double eviction

c752853

Signed-off-by: Mohit Vinchoo <[email protected]>

Add basic unit test with feature flag off

495eec4

Signed-off-by: Mohit Vinchoo <[email protected]>

Merge arguments map

069143c

Signed-off-by: Mohit Vinchoo <[email protected]>

Add basic happy path unittest

f563df6

Signed-off-by: Mohit Vinchoo <[email protected]>

mvinchoo force-pushed the mvinchoo/gangpreempt branch from ba679c2 to f563df6 Compare September 24, 2025 06:59

Add test

211ab87

Signed-off-by: Mohit Vinchoo <[email protected]>

volcano-sh-bot assigned wangyang0616 Sep 25, 2025

hajnalmt mentioned this pull request Sep 26, 2025

Reclaim action may evict more tasks when we close gang reclaimble #4648

Open

Add support for plugin level overrides

6d082c2

Signed-off-by: Mohit Vinchoo <[email protected]>

mvinchoo force-pushed the mvinchoo/gangpreempt branch from 58ddc5b to 6d082c2 Compare September 28, 2025 11:27

Add full feature with multiple policies

952d858

Signed-off-by: Mohit Vinchoo <[email protected]>

volcano-sh-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 5, 2025

mvinchoo changed the title ~~Preempt: Add lightweight strict gang preemption (opt-in)~~ Feature: Implement Gang based preemption Oct 5, 2025

mvinchoo changed the title ~~Feature: Implement Gang based preemption~~ Feature (opt-in): Implement Gang based preemption Oct 5, 2025

mvinchoo changed the title ~~Feature (opt-in): Implement Gang based preemption~~ Feature (opt-in): Implement Gang Aware Preemption Oct 5, 2025

mvinchoo changed the title ~~Feature (opt-in): Implement Gang Aware Preemption~~ Feature (opt-in): Gang Aware Preemption Oct 5, 2025

Set scalar pods to 0 for idle res, fix flakey unit test

93b8c29

Signed-off-by: Mohit Vinchoo <[email protected]>

mvinchoo force-pushed the mvinchoo/gangpreempt branch from 3e2f4da to 93b8c29 Compare October 5, 2025 02:59

restart tests

e66559c

Signed-off-by: Mohit Vinchoo <[email protected]>

gemini-code-assist bot reviewed Oct 17, 2025

View reviewed changes

Feature (opt-in): Gang Aware Preemption #4637

Are you sure you want to change the base?

Feature (opt-in): Gang Aware Preemption #4637

Conversation

mvinchoo commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

volcano-sh-bot commented Sep 24, 2025

Uh oh!

gemini-code-assist bot commented Sep 24, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

volcano-sh-bot commented Sep 24, 2025

Uh oh!

mvinchoo commented Sep 25, 2025

Uh oh!

JesseStutler commented Sep 26, 2025

Uh oh!

mvinchoo commented Sep 26, 2025

Uh oh!

mvinchoo commented Oct 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mvinchoo commented Sep 24, 2025 •

edited

Loading