Skip to content

Conversation

macsko
Copy link
Member

@macsko macsko commented Sep 1, 2025

  • One-line PR description: Add a KEP-5501
  • Other comments:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Sep 1, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: macsko
Once this PR has been reviewed and has the lgtm label, please assign soltysh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling Sep 1, 2025
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Sep 1, 2025
@macsko
Copy link
Member Author

macsko commented Sep 1, 2025

/assign @dom4ha @sanposhiho
/cc @ania-borowiec

@macsko
Copy link
Member Author

macsko commented Sep 1, 2025

/hold
To make sure it's approved by SIG Scheduling leads

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 1, 2025
@macsko macsko force-pushed the kep_5501_reflect_preenqueue_rejections_in_pod_status branch from e4698c8 to c6ddb75 Compare September 1, 2025 13:10
@macsko macsko force-pushed the kep_5501_reflect_preenqueue_rejections_in_pod_status branch from c6ddb75 to 63fca5d Compare September 2, 2025 09:15
@macsko macsko force-pushed the kep_5501_reflect_preenqueue_rejections_in_pod_status branch from 63fca5d to 4e3d0ab Compare September 2, 2025 09:15
Copy link

@ania-borowiec ania-borowiec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

- Gives plugin developers the flexibility to decide if a status update is valuable for their specific logic.

- **Cons:**
- Could lead to perceived inconsistency, where some `PreEnqueue` rejections appear on the Pod status

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it could be inconsistency, but is that any harm really? If there is a need to display a human-readable message to the user, then an empty rejection message can be interpreted as some default message (e.g. saying something about possible transient errors or other potential vague reasons).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this point about no reporting the fact a pod waits or just skipping adding a custom message?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this point about no reporting the fact a pod waits or just skipping adding a custom message?

Yes. See the DRA case (kubernetes/kubernetes#129698 (comment)) and the explanation in the How should plugins provide the status message? section:

This flexibility is important for plugins that wish to avoid reporting transient rejections.
For example, the DynamicResources plugin might observe a rejection because the scheduler processes a Pod faster
than a ResourceClaim becomes visible through the watcher (see a comment).
In such cases where the condition is expected to resolve in seconds, populating the status would be inappropriate noise.


- **Alternatives Considered:**
- Actively clearing the message: The scheduler could clear the condition if,
on a subsequent check, the original rejecting plugin no longer rejects the pod.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would simply checking the diff between states "before" and "now" be enough?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYM?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You said the condition would be cleared if the original rejecting plugin no longer rejects the pod.
I am suggesting that it may be simpler than that: that updates should be sent whenever there is a difference between conditions in cache vs conditions after the current preenqueue run

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updates should be sent whenever there is a difference between conditions in cache vs conditions after the current preenqueue run

But that looks like the current proposal, isn't it? The only case when we wouldn't do this is transition from Plugin A (that reported a message) to Plugin B (that didn't report).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's my point. My thinking is that it's better to display some generic default message for failing Plugin B, than to keep displaying a no-longer-correct message saying that Plugin A failed.

- Actively clearing the message: The scheduler could clear the condition if,
on a subsequent check, the original rejecting plugin no longer rejects the pod.
This would provide a more accurate real-time status but could cause confusion
if the condition appears and disappears while the Pod remains `Pending` for another, unreported reason.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe displaying a default message to the user in case of an empty status (as mentioned in my comment above) would help solve this?
If our goal is to make the messaging meaningful and informative to users, then an explanatory default message could possibly help make things clearer for them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if some default message would be that informative. It won't answer clearly why the pod is on hold. I believe, the slate message might be better than sending such placeholder. And, it would reduce the number of API calls sent.

Copy link

@ania-borowiec ania-borowiec Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I wrote above - my personal opinion is that being not-very-informative is better than being misleading.
I might be wrong, but I assume that if a user looks at the state message, they are inclined to investigate the reason behind the pending state, e.g. check if indeed some resource is not ready. And if they see that the resource is ready and the pod still keeps waiting for it, they might consider this a bug in k8s and report it, causing more work triaging bugs and poor user experience.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that in the case of the Unschedulable reason, when the pod is rejected by filters, we don't remove this message. It is displayed in the Pod status until it's retried. I'm not sure if we can treat a similar mechanism as misleading in the case of PreEnqueue.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a fair point, but Unschedulable reason reflects that the pod is in unschedulable queue (or pending a scheduling cycle) and the scheduler does not yet know if the pod will become schedulable.
In case of PreEnqueue the scheduler does know that the specific PreEnqueue plugin is no longer failing, but in the proposed scenario the scheduler doesn't display this knowledge to the user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, but from the user's POV, the Unschedulable status may be outdated, especially when the scheduling queue contains plenty of pods. So, I expect users to be aware that the longer the status remains unchanged, the more likely it is to be outdated.

Obviously, the NonReadyForScheduling reason could have its own semantic. I'm just still not sure what's the best choice here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any sort of an "average scenario" where we could measure how often pods actually fail scheduling in preenqueue? Or would that look very different for various users?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the scenario, but:

  • For SchedulingGates, PreEnqueue fails until all the gates are removed (so could be multiple times per pod)
  • For DefaultPreemption, PreEnqueue fails until the API calls for all preemption's victims are made (this shouldn't take really long, so maybe 1 PreEnqueue failure per pod is a good assumption)
  • For DRA, PreEnqueue fails until all the ResourceClaims are available. Usually, there is one failure per pod, unless someone forgets to create the ResourceClaim 😃

The integration tests listed above should cover all the scenarios,
so implementing e2e tests is a redundant effort.

### Graduation Criteria

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually what is the intended behavior when the feature is disabled? Using the new API for PreEnqueue, but not filling the new optional field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PreEnqueue would use a new interface, but the reported message will be ignored, i.e. no API call dispatched.


During the beta period, the feature gate `SchedulerPreEnqueuePodStatus` is enabled by default,
so users don't need to opt in. This is a purely in-memory feature for the kube-scheduler,
so no special actions are required outside the scheduler.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except that out-of-tree plugin code needs to be updated to use the new API.
Unless we don't consider them "outside of scheduler"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, in an upgrade strategy, we assume that the scheduler has a new version, so its plugins use the updated PreEnqueue interface. Enablement/Disablement of the feature gate doesn't impact the plugins, as they are already migrated.

Copy link
Member

@dom4ha dom4ha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very well written and very valuable feature. I wonder how Workload Aware Scheduling may change the approach if we could write workload status instead for repeating pods. Maybe we could even write status to a hypothetical PodGroups?

Currently, when a `PreEnqueue` plugin rejects a Pod, this decision is not communicated back to the user
via the Pod's status. The Pod simply remains `Pending`, leaving the user to manually inspect scheduler logs
or other components to diagnose the issue. This lack of feedback is particularly problematic for plugins
that operate transparently to the user, as the reason for the delay is completely hidden.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean exactly?

Copy link
Member Author

@macsko macsko Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any specific sentence in mind? In general, this paragraph describes the current behavior, where we don't communicate the PreEnqueue rejection to the users.

Only for Scheduling Gates we do that through the kube-apiserver - it knows that the pod would be rejected, because it infer that from a pod spec.

Maybe I should rephrase the sentence about operating transparently. I meant that other plugins, like DRA, reject the Pod based on some more indirect rules, which can't be easily inferred from a Pod.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, right, I highlighted "plugins that operate transparently to the user", but the GitHub does not reflect that.

maybe rephrase it?

(e.g., by returning an empty message described above).

- **Pros:**
- Reduces API server load and potential "noise" for plugins that reject pods for very short,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot we mitigate this problem by introducing some sort of delay or rate limiting (for the number of PreEnque updates)? Can document such a possibility with your assessment whether it does or does not make sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take that into consideration and mention in the KEP. We have some kind of rate limiting, as we limit the number of dispatched goroutines in the API dispatcher. I'm not sure how the delay could help here.

Some more advanced mechanisms could be a part of the Asynchronous API calls extension.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delay can help to not report things which have a follow up call clearing the message, so only the prolonged states would be reflected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introducing a delay shifts responsibility from the plugin to the framework. However, I'm not sure if that's really what we want. Why shouldn't the plugin, which has the most information about why it rejected the Pod, decide what to do with the message?

Even if we have such a delay mechanism, how should we set the delay so that it is correct? If it is too low, we may send irrelevant messages (as in the case of DRA or async preemption) and increase the load on the kube-apiserver. If it is too high, we will delay the status update, as a result reducing the possibility of debugging and the purpose for which we want to have this feature.

In addition, we must remember about our current plugins:

  • SchedulingGates — its status is reported by kube-apiserver. We could move it to kube-scheduler, but this would involve unnecessary API calls that can be avoided by keeping the current behavior in kube-apiserver. Also, keep in mind that this plugin has its own status reason (SchedulingGated), which would have to be differentiated somehow if we move it to kube-scheduler.
  • DefaultPreemption — I believe we should rely on the status returned by PostFilter, so it should not be overwritten with the less informative PreEnqueue message.
  • DynamicResources — as mentioned, they may want to decide what and when they want to report.

- Allows plugins to opt out of reporting by returning an empty message.

- **Cons:**
- Requires a breaking change to the `PreEnqueue` plugin interface.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use some standard message for beta to not introduce a breaking change? Providing info that a pod waits and which plugin blocks it should already improve the current state. Also we could prevent plugin developers from creating non-parsable messages by blocking variable messages. We could reevaluate it before moving to GA.

Copy link
Member Author

@macsko macsko Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't introduce such changes just before moving to GA, but rather do so during the beta phase. Keep in mind that v1.35 will include the final changes related to moving interfaces to staging, so plugin developers will have to change their plugin interfaces anyway.

- Gives plugin developers the flexibility to decide if a status update is valuable for their specific logic.

- **Cons:**
- Could lead to perceived inconsistency, where some `PreEnqueue` rejections appear on the Pod status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this point about no reporting the fact a pod waits or just skipping adding a custom message?

- Provides a clear, logical progression for users to follow.

- **Cons:**
- A stale message could be displayed if a Pod is first rejected by a reporting plugin (Plugin A)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the message cannot be updated once it happens?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is in the Actively clearing the message alternative below:

This would provide a more accurate real-time status but could cause confusion
if the condition appears and disappears while the Pod remains Pending for another, unreported reason.

Moreover, we need to consider the number of API calls we would like to send. I believe that if we can omit some of them, we should do so to mitigate the performance impact of this feature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we had delay that addresses the problem of noice, we could require setting status message for all plugins (even default one "Blocked by plugin X"). Of course clearing would not have any delay and would cancel setting a message if it's no longer relevant. Also new message would instantly replace the previous one, but would be delayed as well so we increase likelihood it's also canceled (no unnecessary noice).

I barely remember, but we had some argument against delaying api-calls, but maybe it was only for setting nominations, which are time critical. In this case it should be rather safe.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for default messages "blocked by plugin X" and to clearing messages / updating them with another plugin

The Cluster Autoscaler can be updated to recognize this condition as a valid trigger for scale-up.
When it sees a Pod with this condition, it could correctly request a new node from a node pool that provides required resource.

### Risks and Mitigations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another risk is generating quite significant traffic especially for big workloads. Not sure how to mitigate it, but possible options is to rate limit setting the same type of message. I know it does not sound simple. In Workload Aware Scheduling we could set some kind of messages in the workload object instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add that risk to this section.

Remember that PreEnqueue mostly reports issues with the pod and its direct dependencies' states, not the filtering rejection (=no fit). If the users create large quantity of such pods, it's a problem anyway.

The difference might come with WAS or gang scheduling, which would use PreEnqueue to block the pods from being tried. However, as you mentioned, it would be good to have some workload-level message for WAS to mitigate that.

For other plugins, please remember that such traffic would be made anyway, if we reject the pod on (Pre)Filter, not on the PreEnqueue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delay in setting PreEnqueue status should help (assuming that the waiting is short). True, unschedulable pods need to get error status anyway.

@@ -377,7 +377,8 @@ To prevent API server flooding when a pod is rejected for the same reason repeat
- Before dispatching a status patch, the scheduler will compare the new rejection message
with the cached `lastPreEnqueueRejectionMessage`.
- The asynchronous call to the API server will only be dispatched if the status is new or different from the cached one.
This deduplication is important for performance.
This deduplication is important for performance. Moreover, when the new message is empty or the whole `PreEnqueueResult` is `nil`,
the call won't be send, as explained in the [how should outdated messages be handled](#4-how-should-outdated-messages-be-handled) section.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the call won't be send, as explained in the [how should outdated messages be handled](#4-how-should-outdated-messages-be-handled) section.
the call won't be sent, as explained in the [how should outdated messages be handled](#4-how-should-outdated-messages-be-handled) section.

Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad we can finally start this work, thanks to the async API call feature :)
I just skimmed through, and left some initial comments.

Comment on lines 175 to 177
- Use the raw status message: This is simpler as it requires no interface changes.
However, these messages are often not written well for end-users.
It also makes it difficult for a plugin to conditionally opt out of reporting a status.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually this alternative looks better to me because that would match with messages from other extension points: we're using the messages from the statuses returned from Filter etc already.

It also makes it difficult for a plugin to conditionally opt out of reporting a status.

Do we need to allow plugins not to report the messages? That would just hide the pod stucking reason from users.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, even if you disagree with me and we do want to implement such opt-out behavior, I'd like to just extend the semantics of framework.Status. e.g., if framework.Status has code == Unschedulable with reasons == nil, then we regard plugins want to reject this pod, but not report anything to users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're using the messages from the statuses returned from Filter etc already

I agree that it would be more consistent with the actual plugin's implementation, but if we want to have an opt-out behavior, I'd consider having a new field.

Do we need to allow plugins not to report the messages? That would just hide the pod stucking reason from users.

From the In-Tree Plugin Changes section:

  • SchedulingGates: Will always return an empty message, as its status will continue to be set by the kube-apiserver.
  • DefaultPreemption: Will always return an empty message, as its state is transient
    and the Unschedulable status from the PostFilter stage is more descriptive.
  • DynamicResources: Will be updated to return a descriptive message when a ResourceClaim is missing.
    It can also return a nil result to avoid reporting transient delays,
    for example when waiting for a ResourceClaim to become available in the scheduler's cache.

Allowing not to report the message can reduce the traffic to kube-apiserver, when such operation is irrelevant/redundant.

if framework.Status has code == Unschedulable with reasons == nil, then we regard plugins want to reject this pod, but not report anything to users.

I am not in favor of this approach. If we do this, it will affect logging (we would like to register why plugin rejected the pod, but might not want to display that to user immediately) and make it harder to manage by plugin developers - the interface would be indirect, so they would have to rely on and remember that the empty reason has its own behavior. Introducing a direct field in PreEnqueueResult would make it more direct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have impression that all reasons to not report is to save on an api-server traffic except of "SchedulingGates" which I may not understand well.

Delaying message can reduce a traffic generated by transient issues. I'm only worried that delayed message may overwrite some other message that was set in the meantime by someone else. But all message changes initiated by the kube-scheduler should cancel any previously pending (waiting) one and replace it with a new one or clear it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you updated KEP with good arguments against the delay. Some alternatives are to

  • make the delay optional (but not sure if it's worth implementing it)
  • update workload object status instead of blocked pods (once the generic workload object is defined, it's in the design phase now). Lack of a delay still makes generating noice for tancient issues, although on just one object only.
  • detect that objects are waiting on PreEnqueue for longer period of time. It's like double checking whether the condition still holds before applying a status. Still sounds complicated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the In-Tree Plugin Changes section:
...
Allowing not to report the message can reduce the traffic to kube-apiserver, when such operation is irrelevant/redundant.

Hmm, even for the default preemption / dynamic resources, I feel like it's better to report something to indicate the pods are not retried on purpose since today that is not visible at all. As Dominik mentioned, the delay by the async API call mechanism can mitigate the concern of too-many-apiserver calls. And, if we implement this idea that you mentioned in another thread of mine, I don't think there would be too many API calls triggered by those PreEnqueue (assuming the plugins make the same simple message as possible at every rejection)

Copy link
Member Author

@macsko macsko Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I will add a larger section about the delay, and we will be able to make a decision on that.

(e.g., by returning an empty message described above).

- **Pros:**
- Reduces API server load and potential "noise" for plugins that reject pods for very short,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To mitigate it, can we just skip an error reporting if that's identical with the existing one?
e.g., if the pod keeps being rejected by the preemption PreEnqueue plugin, it will get a message like This pod is not schedulable because waiting for the preemption to be completed for the first time, but the preemption PreEnqueue plugin will (i.e. should) return the same common message and then the scheduling framework doesn't update the pod with the message every time PreEnqueue rejects the same pod with the same reason.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have that already explained in the Preventing redundant API calls section of Design details:

To prevent API server flooding when a pod is rejected for the same reason repeatedly, message caching will be introduced.

  • A new field will be added to the PodInfo struct within the queue, for example, lastPreEnqueueRejectionMessage.
  • Before dispatching a status patch, the scheduler will compare the new rejection message
    with the cached lastPreEnqueueRejectionMessage.
  • The asynchronous call to the API server will only be dispatched if the status is new or different from the cached one. (...)

- Introduces a new value to the API, which must be documented and maintained.

- **Alternatives Considered:**
- PodReasonUnschedulable (`Unschedulable`): Reusing this reason would be somewhat incorrect,
Copy link
Member

@sanposhiho sanposhiho Sep 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to add a new one, prefer to use the existing condition. I feel like adding a new condition is rather confusing for users, especially those who are not very familiar with the scheduler.
Also, keep using the same condition would help other external components (e.g., CA) since they won't need to support a new condition?

as the Pod has not failed a scheduling attempt (i.e., it was never evaluated against nodes).

Today, it could happen that the scheduling cycle doesn't evaluate nodes, but reject a pod at PreFilter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, keep using the same condition would help other external components (e.g., CA) since they won't need to support a new condition

Or break them, if they rely on the fact the pod was processed and rejected by the filters.

I think I was wrong with the CA use case and the PreEnqueue rejection shouldn't be processed by the CA, because such rejections are not related to the nodes (capacity etc.), but to the pod itself. That can't be fixed by the CA's node provisioning. So, using the Unschedulable status would force the CA to process such impossible to fix pods or try to filter them out somehow.

Today, it could happen that the scheduling cycle doesn't evaluate nodes, but reject a pod at PreFilter.

And such behavior could be eventually migrated to the PreEnqueue.

I feel like adding a new condition is rather confusing for users, especially those who are not very familiar with the scheduler.

The PreEnqueue rejection is different from the Unschedulable rejection as it shows that the pod is missing something, so I think having the new reason is okay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was wrong with the CA use case and the PreEnqueue rejection shouldn't be processed by the CA

This is a good point, and I'm convinced.

@helayoty helayoty moved this from Needs Triage to Needs Review in SIG Scheduling Sep 9, 2025
@macsko
Copy link
Member Author

macsko commented Sep 10, 2025

I've updated the KEP with the alternatives mentioned in the comments.

@@ -194,6 +199,14 @@ In such cases where the condition is expected to resolve in seconds, populating
- Mandatory reporting for all rejections: This would make the scheduler's behavior uniform.
However, it might create significant API churn for plugins that reject pods frequently and for short durations,
negatively impacting performance and UX.
- Mandatory reporting with a delay (cooldown period): This approach attempts to reduce API churn by waiting for a brief,
configurable delay before reporting a rejection. While better than immediate mandatory reporting, this has some flaws:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who should configure the delay?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. We could have a few options:

  • Hardcoded delay per API call, e.g., delay only the PreEnqueue status patch call by a certain time
  • Configurable delay through scheduler's config - might be hard to implement clearly
  • Per-Plugin delay, passed via PreEnqueueResult - would move the responsibility to the plugin developers, which might not know well what is the correct delay anyway

the scheduler could generate a generic message (e.g., "Pod is blocked on PreEnqueue by plugin: Plugin B").
This has a two flaws:
- While it identifies the blocking plugin, it doesn't explain why the Pod was blocked, which is the essential information for debugging.
A generic message isn't a significant improvement over a stale (but potentially actionable) message.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion stale != actionable 🙂 It was actionable, but now is no longer because it's stale.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale is still actionable (someone can try to fix the issue or just wait for it to resolve), compared to the useless (slightly informative) message that is not actionable at all.

@macsko
Copy link
Member Author

macsko commented Sep 12, 2025

I added a section to Alternatives describing five possible design alternatives, including delays and implicit/explicit interface. PTAL

It might be still lacking some points, so please tell me if I missed anything.

FYI I'll respond to the new comments next Friday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Status: Needs Review
Development

Successfully merging this pull request may close these issues.

5 participants