-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Feature (opt-in): Gang Aware Preemption #4637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Welcome @mvinchoo! |
Summary of ChangesHello @mvinchoo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces an optional, lightweight strict gang preemption feature to the scheduler. This enhancement addresses issues in AI/ML workloads where partial evictions can lead to job stalls or failures. By enabling this feature, if a single task within a gang is chosen for preemption, the entire gang will be evicted, ensuring that resources are fully cleared for higher-priority tasks. This approach aims to improve overall cluster efficiency and job progress while maintaining a minimal impact on scheduling latency and code risk. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an opt-in feature for strict gang preemption, which is a valuable addition for workloads where partial preemption is undesirable. The implementation correctly adds the configuration and integrates it into the preemption action. My review identified a critical issue in the gang preemption logic that could lead to double-counting of resources, as well as a minor redundancy and an issue with evicting tasks in terminal states. I've provided detailed comments and suggestions to address these points. With these fixes, the feature should work as intended.
2f5f616
to
4370c83
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
9f63988
to
0a5592a
Compare
Signed-off-by: Mohit Vinchoo <[email protected]>
Signed-off-by: Mohit Vinchoo <[email protected]>
Signed-off-by: Mohit Vinchoo <[email protected]>
Signed-off-by: Mohit Vinchoo <[email protected]>
Signed-off-by: Mohit Vinchoo <[email protected]>
Signed-off-by: Mohit Vinchoo <[email protected]>
ba679c2
to
f563df6
Compare
Signed-off-by: Mohit Vinchoo <[email protected]>
/assign @wangyang0616 |
/cc Will take a look after v1.13.0 released |
Thank you! @JesseStutler |
Signed-off-by: Mohit Vinchoo <[email protected]>
58ddc5b
to
6d082c2
Compare
Signed-off-by: Mohit Vinchoo <[email protected]>
Signed-off-by: Mohit Vinchoo <[email protected]>
3e2f4da
to
93b8c29
Compare
Signed-off-by: Mohit Vinchoo <[email protected]>
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable gang-aware preemption feature with minimal
and atomic
modes. The implementation is well-structured and follows a sensible two-phase heuristic for victim selection. The accompanying tests are comprehensive and cover various scenarios for the new modes.
My review focuses on ensuring correctness and improving maintainability. I've identified a critical issue where gangPreempt
may return an incorrect status, potentially misleading the scheduler. I've also found opportunities to simplify logic, remove redundant code, and improve readability. Additionally, there's a duplicated test case that should be removed.
Overall, this is a great addition. Addressing these points will make the implementation more robust and easier to maintain.
// If this check fails, it implies some Evictions failed. | ||
// Since we are optimizing for gangs per node we should try again in next session | ||
if ssn.Allocatable(currentQueue, preemptor) && preemptor.InitResreq.LessEqual(targetNode.FutureIdle(), api.Zero) { | ||
if err := stmt.Pipeline(preemptor, targetNode.Name, !preempted.IsEmpty()); err != nil { | ||
klog.Errorf("Failed to pipeline Task <%s/%s> on Node <%s>", | ||
preemptor.Namespace, preemptor.Name, targetNode.Name) | ||
if rollbackErr := stmt.UnPipeline(preemptor); rollbackErr != nil { | ||
klog.Errorf("Failed to unpipeline Task %v on %v in Session %v for %v.", | ||
preemptor.UID, targetNode.Name, ssn.UID, rollbackErr) | ||
} | ||
} | ||
} | ||
return true, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function gangPreempt
may return true
even when the preemptor task is not successfully pipelined. This is inconsistent with normalPreempt
and could cause the scheduler to behave incorrectly, for example by not retrying to schedule a job that could have been scheduled. The function should only return true
if the preemption leads to a successful allocation of the preemptor task.
To fix this, you can introduce an assigned
boolean variable, similar to how it's used in normalPreempt
, to track the outcome of the Pipeline
operation and return its value at the end.
assigned := false
// If this check fails, it implies some Evictions failed.
// Since we are optimizing for gangs per node we should try again in next session
if ssn.Allocatable(currentQueue, preemptor) && preemptor.InitResreq.LessEqual(targetNode.FutureIdle(), api.Zero) {
if err := stmt.Pipeline(preemptor, targetNode.Name, !preempted.IsEmpty()); err != nil {
klog.Errorf("Failed to pipeline Task <%s/%s> on Node <%s>",
preemptor.Namespace, preemptor.Name, targetNode.Name)
if rollbackErr := stmt.UnPipeline(preemptor); rollbackErr != nil {
klog.Errorf("Failed to unpipeline Task %v on %v in Session %v for %v.",
preemptor.UID, targetNode.Name, ssn.UID, rollbackErr)
}
} else {
assigned = true
}
}
return assigned, nil
}
if filter == nil { | ||
preemptees = append(preemptees, task.Clone()) | ||
} else if filter(task) { | ||
preemptees = append(preemptees, task.Clone()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if t.Job == preemptor.Job { | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort.Slice(cand, func(i, j int) bool { | ||
return !cand[i].res.LessEqual(cand[j].res, api.Zero) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sorting logic !cand[i].res.LessEqual(cand[j].res, api.Zero)
is functionally correct for a descending sort, but it's not very readable. Using cand[j].res.Less(cand[i].res, api.Zero)
would be more idiomatic and easier to understand, as it clearly expresses "is j less than i?".
sort.Slice(cand, func(i, j int) bool {
return cand[j].res.Less(cand[i].res, api.Zero)
})
{ | ||
Name: "pick the gang with the least overage for minimal mode deterministic nodes for tied overage", | ||
PodGroups: []*schedulingv1beta1.PodGroup{ | ||
util.BuildPodGroupWithPrio("pg1", "c1", "q1", 0, map[string]int32{}, schedulingv1beta1.PodGroupInqueue, "low-priority"), | ||
util.BuildPodGroupWithPrio("pg2", "c1", "q1", 0, map[string]int32{}, schedulingv1beta1.PodGroupInqueue, "low-priority"), | ||
util.BuildPodGroupWithPrio("pg3", "c1", "q1", 1, map[string]int32{}, schedulingv1beta1.PodGroupInqueue, "high-priority"), | ||
}, | ||
Pods: []*v1.Pod{ | ||
util.BuildPod("c1", "preemptee1", "n1", v1.PodRunning, api.BuildResourceList("1", "1G"), "pg1", map[string]string{schedulingv1beta1.PodPreemptable: "true"}, make(map[string]string)), | ||
util.BuildPod("c1", "preemptee2", "n1", v1.PodRunning, api.BuildResourceList("1", "1G"), "pg1", map[string]string{schedulingv1beta1.PodPreemptable: "false"}, make(map[string]string)), | ||
util.BuildPod("c1", "preemptee3", "n1", v1.PodRunning, api.BuildResourceList("1", "1G"), "pg1", map[string]string{schedulingv1beta1.PodPreemptable: "true"}, make(map[string]string)), | ||
util.BuildPod("c1", "preemptee4", "n2", v1.PodRunning, api.BuildResourceList("1", "1G"), "pg2", map[string]string{schedulingv1beta1.PodPreemptable: "true"}, make(map[string]string)), | ||
util.BuildPod("c1", "preemptee5", "n2", v1.PodRunning, api.BuildResourceList("1", "1G"), "pg2", map[string]string{schedulingv1beta1.PodPreemptable: "true"}, make(map[string]string)), | ||
util.BuildPod("c1", "preemptor1", "", v1.PodPending, api.BuildResourceList("2", "2G"), "pg3", make(map[string]string), make(map[string]string)), | ||
}, | ||
Nodes: []*v1.Node{ | ||
util.BuildNode("n1", api.BuildResourceList("3", "3G", []api.ScalarResource{{Name: "pods", Value: "10"}}...), make(map[string]string)), | ||
util.BuildNode("n2", api.BuildResourceList("2", "2G", []api.ScalarResource{{Name: "pods", Value: "10"}}...), make(map[string]string)), | ||
}, | ||
Queues: []*schedulingv1beta1.Queue{ | ||
util.BuildQueue("q1", 1, nil), | ||
}, | ||
ExpectEvicted: []string{"c1/preemptee1", "c1/preemptee3"}, | ||
ExpectEvictNum: 2, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What type of PR is this?
Feature: Gang-aware preemption modes for AI/ML training and other gang-scheduled workloads (PodGroup/VCJob).
What this PR does / why we need it:
Today, preemption picks nodes first and only then decides which tasks to evict, without considering gang (PodGroup/VCJob) boundaries. As a result, victims can span multiple jobs, and the Gang plugin ends up acting as a guard-rail (via minMember) rather than guiding which gang should be preempted. This can yield partial/fragmented evictions that are sub-optimal for AI/ML jobs.
This PR introduces an opt-in gang-aware preemption strategy that selects victims at the gang level. It reduces cross-job churn, preserves job atomicity when desired, and better aligns with workloads that need “all-or-nothing” capacity.
Modes (policy):
disabled
(default): Existing behavior. Preemption remains node-first and task-level; no gang semantics applied.minimal
: Preemption will group preemptable tasks on each node into gangs thenatomic
: Same selection as minimal, but if a victim gang is chosen, evict all of its tasks across all nodes (cluster-wide gang eviction). This matches AI/ML expectations where partial eviction often dooms the job anyway.Fewer cross-job evictions: Victims come from a coherent gang, not scattered tasks across jobs.
Predictability for gang jobs: Respects “start/stop together” intent; atomic ensures clean state.
Backwards compatible: Fully opt-in; disabled preserves current behavior.
Which issue(s) this PR fixes:
Partially fixes #4607.
In addition to the changes discussed in the above issue, this PR also adds:
atomic
mode.Happy to create a new issue if needed.
Special notes for your reviewer:
Minimize the number of gangs disrupted on a node without hurting scheduler throughput.
Per node combos = sum_{t=1..J} C(J, t) = 2^J − 1 (exponential)
Exponential time is unacceptable in a high-throughput scheduler like Volcano.
If any single gang can satisfy the preemptor’s need on a node, choose the best one (least overage).
Rationale: in most cases, preempting one gang frees enough capacity for the incoming task.
If Phase 1 fails (e.g., preemptor is large), sort candidate gangs by size (largest first) and accumulate until the need is met.
Complexity: O(J log J) per node (sort) + O(J) accumulate — suitable for high QPS.
Preemptor need: 80
Node gangs: [50, 50, 50, 50, 20, 20, 20, 20]
Greedy (fast): 50 + 50 = 100 → meets need with +20 overage
Optimal overage (exhaustive search): 20 + 20 + 20 + 20 = 80 → 0 overage
We accept small overage to preserve throughput and avoid exponential search.
Avoids the 2^J blow-up while keeping decisions near-optimal in practice.
Does this PR introduce a user-facing change?
Yes