Skip to content

Conversation

mvinchoo
Copy link

@mvinchoo mvinchoo commented Sep 24, 2025

What type of PR is this?

Feature: Gang-aware preemption modes for AI/ML training and other gang-scheduled workloads (PodGroup/VCJob).

What this PR does / why we need it:

Today, preemption picks nodes first and only then decides which tasks to evict, without considering gang (PodGroup/VCJob) boundaries. As a result, victims can span multiple jobs, and the Gang plugin ends up acting as a guard-rail (via minMember) rather than guiding which gang should be preempted. This can yield partial/fragmented evictions that are sub-optimal for AI/ML jobs.

This PR introduces an opt-in gang-aware preemption strategy that selects victims at the gang level. It reduces cross-job churn, preserves job atomicity when desired, and better aligns with workloads that need “all-or-nothing” capacity.

Modes (policy):

  • disabled (default): Existing behavior. Preemption remains node-first and task-level; no gang semantics applied.

  • minimal: Preemption will group preemptable tasks on each node into gangs then

    • On each candidate node, group preemptable tasks by gang.
    • Prefer a single gang whose eviction satisfies the preemptor task’s request (most cases).
    • If no single gang suffices, fall back to a greedy gang selection (still within gang units) until the preemptor fits (rare).
  • atomic: Same selection as minimal, but if a victim gang is chosen, evict all of its tasks across all nodes (cluster-wide gang eviction). This matches AI/ML expectations where partial eviction often dooms the job anyway.

  • Fewer cross-job evictions: Victims come from a coherent gang, not scattered tasks across jobs.

  • Predictability for gang jobs: Respects “start/stop together” intent; atomic ensures clean state.

  • Backwards compatible: Fully opt-in; disabled preserves current behavior.

Which issue(s) this PR fixes:

Partially fixes #4607.
In addition to the changes discussed in the above issue, this PR also adds:

  • Handles node fragmentation.
  • It also adds atomic mode.

Happy to create a new issue if needed.

Special notes for your reviewer:

  • Objective:
    Minimize the number of gangs disrupted on a node without hurting scheduler throughput.
  • Why not try all k (1..J) combinations:
    Per node combos = sum_{t=1..J} C(J, t) = 2^J − 1 (exponential)
    Exponential time is unacceptable in a high-throughput scheduler like Volcano.
  • Two-phase heuristic (fast, practical):
    • Phase 1 – Single-gang fit (minimal overage):
      If any single gang can satisfy the preemptor’s need on a node, choose the best one (least overage).
      Rationale: in most cases, preempting one gang frees enough capacity for the incoming task.
    • Phase 2 – Greedy multi-gang (minimize count, then overage):
      If Phase 1 fails (e.g., preemptor is large), sort candidate gangs by size (largest first) and accumulate until the need is met.
      Complexity: O(J log J) per node (sort) + O(J) accumulate — suitable for high QPS.
  • Trade-off example (1D, non-negative):
    Preemptor need: 80
    Node gangs: [50, 50, 50, 50, 20, 20, 20, 20]
    Greedy (fast): 50 + 50 = 100 → meets need with +20 overage
    Optimal overage (exhaustive search): 20 + 20 + 20 + 20 = 80 → 0 overage
  • Conclusion:
    We accept small overage to preserve throughput and avoid exponential search.
    Avoids the 2^J blow-up while keeping decisions near-optimal in practice.

Does this PR introduce a user-facing change?

Yes

The following block needs to be added to `volcano-scheduler-configmap`
data:
  volcano-scheduler.conf: |
    actions: "...preempt..."
    configurations:
    - name: preempt
      arguments:
        gangPreemptionMode: off  # Accepted values: off (default) | minimal | atomic

@volcano-sh-bot
Copy link
Contributor

Welcome @mvinchoo!

It looks like this is your first PR to volcano-sh/volcano.

Thank you, and welcome to Volcano. 😃

Copy link

Summary of Changes

Hello @mvinchoo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces an optional, lightweight strict gang preemption feature to the scheduler. This enhancement addresses issues in AI/ML workloads where partial evictions can lead to job stalls or failures. By enabling this feature, if a single task within a gang is chosen for preemption, the entire gang will be evicted, ensuring that resources are fully cleared for higher-priority tasks. This approach aims to improve overall cluster efficiency and job progress while maintaining a minimal impact on scheduling latency and code risk.

Highlights

  • Strict Gang Preemption: Introduces an opt-in mechanism for strict gang preemption, where if any task from a gang (PodGroup/Job) is selected as a victim, the entire gang is preempted.
  • Improved Workload Stability: Aims to prevent cascading failures in AI/ML gang workloads caused by partial evictions and improves the chance of higher-priority jobs making progress by freeing coherent capacity.
  • Configuration: The feature is opt-in and can be enabled by setting enableStrictGangPreemption: true in the volcano-scheduler-configmap under the preempt configuration.
  • Minimal Impact: The implementation is intentionally lightweight, preserves current behavior by default, and adapts gang semantics to the existing node-based preemption pipeline.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@volcano-sh-bot volcano-sh-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 24, 2025
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an opt-in feature for strict gang preemption, which is a valuable addition for workloads where partial preemption is undesirable. The implementation correctly adds the configuration and integrates it into the preemption action. My review identified a critical issue in the gang preemption logic that could lead to double-counting of resources, as well as a minor redundancy and an issue with evicting tasks in terminal states. I've provided detailed comments and suggestions to address these points. With these fixes, the feature should work as intended.

@mvinchoo mvinchoo force-pushed the mvinchoo/gangpreempt branch 2 times, most recently from 2f5f616 to 4370c83 Compare September 24, 2025 06:17
@volcano-sh-bot volcano-sh-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 24, 2025
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign wangyang0616
You can assign the PR to them by writing /assign @wangyang0616 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 24, 2025
@mvinchoo mvinchoo force-pushed the mvinchoo/gangpreempt branch 2 times, most recently from 9f63988 to 0a5592a Compare September 24, 2025 06:32
@volcano-sh-bot volcano-sh-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 24, 2025
Mohit Vinchoo added 6 commits September 23, 2025 23:59
Signed-off-by: Mohit Vinchoo <[email protected]>
Signed-off-by: Mohit Vinchoo <[email protected]>
Signed-off-by: Mohit Vinchoo <[email protected]>
Signed-off-by: Mohit Vinchoo <[email protected]>
@mvinchoo mvinchoo force-pushed the mvinchoo/gangpreempt branch from ba679c2 to f563df6 Compare September 24, 2025 06:59
Signed-off-by: Mohit Vinchoo <[email protected]>
@mvinchoo
Copy link
Author

/assign @wangyang0616

@JesseStutler
Copy link
Member

/cc Will take a look after v1.13.0 released

@mvinchoo
Copy link
Author

/cc Will take a look after v1.13.0 released

Thank you! @JesseStutler

@mvinchoo mvinchoo force-pushed the mvinchoo/gangpreempt branch from 58ddc5b to 6d082c2 Compare September 28, 2025 11:27
@volcano-sh-bot volcano-sh-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 5, 2025
@mvinchoo mvinchoo changed the title Preempt: Add lightweight strict gang preemption (opt-in) Feature: Implement Gang based preemption Oct 5, 2025
@mvinchoo mvinchoo changed the title Feature: Implement Gang based preemption Feature (opt-in): Implement Gang based preemption Oct 5, 2025
@mvinchoo mvinchoo changed the title Feature (opt-in): Implement Gang based preemption Feature (opt-in): Implement Gang Aware Preemption Oct 5, 2025
@mvinchoo mvinchoo changed the title Feature (opt-in): Implement Gang Aware Preemption Feature (opt-in): Gang Aware Preemption Oct 5, 2025
@mvinchoo mvinchoo force-pushed the mvinchoo/gangpreempt branch from 3e2f4da to 93b8c29 Compare October 5, 2025 02:59
Signed-off-by: Mohit Vinchoo <[email protected]>
@mvinchoo
Copy link
Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable gang-aware preemption feature with minimal and atomic modes. The implementation is well-structured and follows a sensible two-phase heuristic for victim selection. The accompanying tests are comprehensive and cover various scenarios for the new modes.

My review focuses on ensuring correctness and improving maintainability. I've identified a critical issue where gangPreempt may return an incorrect status, potentially misleading the scheduler. I've also found opportunities to simplify logic, remove redundant code, and improve readability. Additionally, there's a duplicated test case that should be removed.

Overall, this is a great addition. Addressing these points will make the implementation more robust and easier to maintain.

Comment on lines +513 to +526
// If this check fails, it implies some Evictions failed.
// Since we are optimizing for gangs per node we should try again in next session
if ssn.Allocatable(currentQueue, preemptor) && preemptor.InitResreq.LessEqual(targetNode.FutureIdle(), api.Zero) {
if err := stmt.Pipeline(preemptor, targetNode.Name, !preempted.IsEmpty()); err != nil {
klog.Errorf("Failed to pipeline Task <%s/%s> on Node <%s>",
preemptor.Namespace, preemptor.Name, targetNode.Name)
if rollbackErr := stmt.UnPipeline(preemptor); rollbackErr != nil {
klog.Errorf("Failed to unpipeline Task %v on %v in Session %v for %v.",
preemptor.UID, targetNode.Name, ssn.UID, rollbackErr)
}
}
}
return true, nil
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The function gangPreempt may return true even when the preemptor task is not successfully pipelined. This is inconsistent with normalPreempt and could cause the scheduler to behave incorrectly, for example by not retrying to schedule a job that could have been scheduled. The function should only return true if the preemption leads to a successful allocation of the preemptor task.

To fix this, you can introduce an assigned boolean variable, similar to how it's used in normalPreempt, to track the outcome of the Pipeline operation and return its value at the end.

    assigned := false
    // If this check fails, it implies some Evictions failed.
    // Since we are optimizing for gangs per node we should try again in next session
    if ssn.Allocatable(currentQueue, preemptor) && preemptor.InitResreq.LessEqual(targetNode.FutureIdle(), api.Zero) {
        if err := stmt.Pipeline(preemptor, targetNode.Name, !preempted.IsEmpty()); err != nil {
            klog.Errorf("Failed to pipeline Task <%s/%s> on Node <%s>",
                preemptor.Namespace, preemptor.Name, targetNode.Name)
            if rollbackErr := stmt.UnPipeline(preemptor); rollbackErr != nil {
                klog.Errorf("Failed to unpipeline Task %v on %v in Session %v for %v.",
                    preemptor.UID, targetNode.Name, ssn.UID, rollbackErr)
            }
        } else {
            assigned = true
        }
    }
    return assigned, nil
}

Comment on lines +448 to +452
if filter == nil {
preemptees = append(preemptees, task.Clone())
} else if filter(task) {
preemptees = append(preemptees, task.Clone())
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This conditional logic can be simplified. Both branches append task.Clone() to preemptees. You can combine them into a single if statement with an || operator for better readability.

            if filter == nil || filter(task) {
                preemptees = append(preemptees, task.Clone())
            }

Comment on lines +555 to +557
if t.Job == preemptor.Job {
continue
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This check is redundant. The nodeJobVictimsMap is built from preemptees, which are already filtered in gangPreempt to exclude tasks from the preemptor's own job. Removing this unnecessary check will make the code cleaner.

Comment on lines +606 to +608
sort.Slice(cand, func(i, j int) bool {
return !cand[i].res.LessEqual(cand[j].res, api.Zero)
})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The sorting logic !cand[i].res.LessEqual(cand[j].res, api.Zero) is functionally correct for a descending sort, but it's not very readable. Using cand[j].res.Less(cand[i].res, api.Zero) would be more idiomatic and easier to understand, as it clearly expresses "is j less than i?".

        sort.Slice(cand, func(i, j int) bool {
            return cand[j].res.Less(cand[i].res, api.Zero)
        })

Comment on lines +700 to +724
{
Name: "pick the gang with the least overage for minimal mode deterministic nodes for tied overage",
PodGroups: []*schedulingv1beta1.PodGroup{
util.BuildPodGroupWithPrio("pg1", "c1", "q1", 0, map[string]int32{}, schedulingv1beta1.PodGroupInqueue, "low-priority"),
util.BuildPodGroupWithPrio("pg2", "c1", "q1", 0, map[string]int32{}, schedulingv1beta1.PodGroupInqueue, "low-priority"),
util.BuildPodGroupWithPrio("pg3", "c1", "q1", 1, map[string]int32{}, schedulingv1beta1.PodGroupInqueue, "high-priority"),
},
Pods: []*v1.Pod{
util.BuildPod("c1", "preemptee1", "n1", v1.PodRunning, api.BuildResourceList("1", "1G"), "pg1", map[string]string{schedulingv1beta1.PodPreemptable: "true"}, make(map[string]string)),
util.BuildPod("c1", "preemptee2", "n1", v1.PodRunning, api.BuildResourceList("1", "1G"), "pg1", map[string]string{schedulingv1beta1.PodPreemptable: "false"}, make(map[string]string)),
util.BuildPod("c1", "preemptee3", "n1", v1.PodRunning, api.BuildResourceList("1", "1G"), "pg1", map[string]string{schedulingv1beta1.PodPreemptable: "true"}, make(map[string]string)),
util.BuildPod("c1", "preemptee4", "n2", v1.PodRunning, api.BuildResourceList("1", "1G"), "pg2", map[string]string{schedulingv1beta1.PodPreemptable: "true"}, make(map[string]string)),
util.BuildPod("c1", "preemptee5", "n2", v1.PodRunning, api.BuildResourceList("1", "1G"), "pg2", map[string]string{schedulingv1beta1.PodPreemptable: "true"}, make(map[string]string)),
util.BuildPod("c1", "preemptor1", "", v1.PodPending, api.BuildResourceList("2", "2G"), "pg3", make(map[string]string), make(map[string]string)),
},
Nodes: []*v1.Node{
util.BuildNode("n1", api.BuildResourceList("3", "3G", []api.ScalarResource{{Name: "pods", Value: "10"}}...), make(map[string]string)),
util.BuildNode("n2", api.BuildResourceList("2", "2G", []api.ScalarResource{{Name: "pods", Value: "10"}}...), make(map[string]string)),
},
Queues: []*schedulingv1beta1.Queue{
util.BuildQueue("q1", 1, nil),
},
ExpectEvicted: []string{"c1/preemptee1", "c1/preemptee3"},
ExpectEvictNum: 2,
},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test case, "pick the gang with the least overage for minimal mode deterministic nodes for tied overage", is a duplicate of the one defined at lines 650-674. Please remove this redundant test case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Volcano scheduler doesn’t reclaim/preempt jobs as a gang.

4 participants