Skip to content

Conversation

@shchennu
Copy link

@shchennu shchennu commented May 1, 2025

Add Parent Queue Quota and Limit Checking to KAI Scheduler

Description

This PR enhances the KAI Scheduler with comprehensive parent queue resource management, implementing both quota and limit checking at parent queue levels. The changes improve scheduler efficiency by preventing unnecessary scheduling attempts and provide better resource management for different job types.

Changes

  • Added checkParentQueueLimits method to CapacityPolicy to check both limits and quotas at parent queue levels
  • Added getFirstPendingPod helper function to identify the first pending pod in a job
  • Modified OnSessionOpen to include early validation of resource limits
  • Added comprehensive test cases for parent queue limit checking
  • Added support for elastic jobs with minimum required resources
  • Separated job priority from preemptibility handling

Implementation Details

  • Resource checks are performed for GPU, CPU, and Memory resources
  • Both limits and quotas are checked, using the more restrictive value
  • Checks are done at each parent queue level in the hierarchy
  • Early validation prevents unnecessary scheduling attempts
  • Proper error messages indicate which resource limit/quota was exceeded
  • Special handling for preemptible jobs (PriorityInferenceNumber)
  • Support for elastic jobs with minimum resource requirements

Test Coverage

  • Basic limit/quota checks for each resource type (GPU, CPU, Memory)
  • Multi-level queue hierarchy checks
  • Elastic job handling with minimum required resources
  • Preemptible vs non-preemptible job behavior
  • Edge cases (zero limits/quotas, missing values)
  • Error message formatting
  • Helper function validation

Files Changed

  • Modified: pkg/scheduler/plugins/proportion/capacity_policy/capacity_policy.go
  • Modified: pkg/scheduler/plugins/proportion/capacity_policy/parent_queue_test.go

Testing

All tests are passing:

  • 37 specs from the original test suite
  • 10 new test cases for parent queue limit checking
  • 3 test cases for the helper function

Impact

  • Improves scheduler efficiency by preventing unnecessary scheduling attempts
  • Provides clear error messages when limits/quotas are exceeded
  • Better resource management for different job types (preemptible, elastic)

Copy link
Collaborator

@romanbaron romanbaron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shchennu 👋
Thanks for suggesting this change, I want to better understand the motivation here, more specifically what is the gap with IsJobOverQueueCapacity? From what I see it is called before PrePredicateFn so there is probably something that you identified as a gap in the implementation?

Another question, what is the problem with job allocation above parent's queue quota? If the job is preemptible it can be allowed.

@shchennu shchennu requested a review from romanbaron May 4, 2025 17:09
@shchennu
Copy link
Author

shchennu commented May 4, 2025

Hi @shchennu 👋 Thanks for suggesting this change, I want to better understand the motivation here, more specifically what is the gap with IsJobOverQueueCapacity? From what I see it is called before PrePredicateFn so there is probably something that you identified as a gap in the implementation?

Another question, what is the problem with job allocation above parent's queue quota? If the job is preemptible it can be allowed.

Hi @romanbaron 👋

Thank you for your questions! Let me address them:

  1. Gap with IsJobOverQueueCapacity:

    • IsJobOverQueueCapacity only checks the immediate queue's capacity
    • Our new checkParentQueueQuotas function checks ALL parent queues up the hierarchy
    • This is important because a job might fit in its immediate queue but exceed parent queue quotas
    • By checking in PrePredicateFn, we fail fast and avoid unnecessary scheduling attempts
  2. Preemptible Jobs and Parent Queue Quotas:

    • You're absolutely right! Preemptible jobs should be allowed to exceed parent queue quotas
    • I've updated the implementation to handle this:
      • Added a check for PriorityTrainNumber at the start of checkParentQueueQuotas
      • Preemptible jobs now skip quota checks entirely
      • Non-preemptible jobs still maintain strict quota enforcement
    • This change is reflected in the test cases:
      • preemptible_job_can_exceed_parent_queue_GPU_quota
      • non-preemptible_job_cannot_exceed_parent_queue_GPU_quota

The changes ensure that:

  • Preemptible jobs can utilize resources beyond parent queue quotas
  • Non-preemptible jobs maintain strict quota enforcement
  • We still get early validation for jobs that would exceed quotas

@shchennu shchennu marked this pull request as ready for review May 4, 2025 17:16
Copy link
Collaborator

@enoodle enoodle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
I left a few comments, mostly about style.

I do have a question about the logic: how will this handle elastic jobs who doesn't have to schedule all pods to run? The job has a minimal number of pods that it needs to run, and you don't know how many will actually fit in the cluster / queue limits.

// OnSessionOpen is called when a new scheduling session begins. It registers
// the early quota checking function that prevents jobs from being considered
// for scheduling if they would exceed their parent queues' quotas.
func (cp *CapacityPolicy) OnSessionOpen(ssn *framework.Session) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am missing something, but I don't think this is called by the changes in the current PR because this is not a plugin that is registered.


// capacityCheckFn is a function type that checks if a job's requested resources
// exceed capacity limits. It returns a SchedulableResult indicating whether the
// job can be scheduled.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a style comment, because this is a new project and all, but we don't have many obvious comments here (the comment really doesn't add anything that is not written in the next line)

Comment on lines 170 to 177
// Only check parent queues, not the job's direct queue
currentQueueID := queue.ParentQueue

for currentQueueID != "" {
parentQueue, found := ssn.Queues[currentQueueID]
if !found {
break
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This together with line 239 can be one line of for with initialization, check and step

}
}

// Check GPU quota
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part seems duplicated 3 times, there are functions to avoid it in the code base. you can loop on the resource type and the . look at proportion/proportion.go for example.

// Check GPU quota
if parentQueue.Resources.GPU.Quota > 0 && jobResources.GPUs() > float64(parentQueue.Resources.GPU.Quota) {
errorMsg := fmt.Sprintf(
"parent queue '%s' quota has reached the allowable limit of GPUs. "+
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A queue can go over the quota, but not the limit. I would change the whole thing to address the limit.

// queue quotas, while non-preemptible jobs must strictly adhere to quotas.
func (cp *CapacityPolicy) checkParentQueueQuotas(job *podgroup_info.PodGroupInfo, ssn *framework.Session) error {
// Skip quota checks for preemptible jobs
if job.Priority == constants.PriorityTrainNumber {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aim to separate the job priority and preemptibility soon, and the Train priority is not the only one that is currently preemptible.
We do have several scalars to define the resource of a queue: quota, limit and over quota weight. We should look at limit here and not quota and then it will be correct for all jobs.

// Register early quota checks
ssn.AddPrePredicateFn(func(task *pod_info.PodInfo, job *podgroup_info.PodGroupInfo) error {
// Only check for the first pending pod to avoid duplicate checks
firstPending := getFirstPendingPod(job)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the fact that the job needs scheduling (and then predicates are running on it) is not enough to understand that it has pending pods?

@shchennu shchennu requested a review from enoodle May 5, 2025 13:25
@shchennu shchennu force-pushed the scheduler_early_validation branch from 8b59cf5 to 301550c Compare May 5, 2025 13:43
@shchennu
Copy link
Author

shchennu commented May 5, 2025

Hi, I left a few comments, mostly about style.

I do have a question about the logic: how will this handle elastic jobs who doesn't have to schedule all pods to run? The job has a minimal number of pods that it needs to run, and you don't know how many will actually fit in the cluster / queue limits.

Thanks for the review! I've addressed all the comments:

  • Added support for elastic jobs with minimum resource requirements
  • Improved code style by removing redundant comments and combining loops
  • Now checking both limits and quotas, using the more restrictive value
  • Separated job priority from preemptibility
  • Added test coverage

All tests are passing. Let me know if you'd like any further improvements!

@enoodle
Copy link
Collaborator

enoodle commented May 6, 2025

  1. Gap with IsJobOverQueueCapacity:

    • IsJobOverQueueCapacity only checks the immediate queue's capacity
    • Our new checkParentQueueQuotas function checks ALL parent queues up the hierarchy
    • This is important because a job might fit in its immediate queue but exceed parent queue quotas
    • By checking in PrePredicateFn, we fail fast and avoid unnecessary scheduling attempts

@shchennu I looked at IsJobOverQueueCapacity and I don't agree with your observation - it is checking all the parent queues, so this addition is not needed.
https://github.com/NVIDIA/KAI-Scheduler/blob/main/pkg/scheduler/plugins/proportion/capacity_policy/quota_check.go#L34

@romanbaron
Copy link
Collaborator

@shchennu - all hierarchy levels are validated in these two places:
resultsOverLimit - checks that workload scheduling won't cause any of the queues (from leaf to top) to go over limit.
resultsWithNonPreemptibleOverQuota - checks that non-preemptible workload scheduling won't cause any of the queues (from leaf to top) to go over quota.

IsJobOverQueueCapacity calls these two functions here:

checkFns := []capacityCheckFn{cp.resultsOverLimit, cp.resultsWithNonPreemptibleOverQuota}

Doesn't it cover it?

@romanbaron
Copy link
Collaborator

@shchennu - can you please update on your thoughts regarding the questions above?

@github-actions github-actions bot added the stale label Oct 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants