Skip to content

Conversation

mortent
Copy link
Member

@mortent mortent commented Sep 5, 2025

  • One-line PR description: Update the design to simplify validation while preserving flexibility.
  • Other comments:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Sep 5, 2025
@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Sep 5, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling Sep 5, 2025
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 5, 2025
@mortent mortent force-pushed the UpdatePartitionableFor135 branch 4 times, most recently from 901e51e to 8443d6e Compare September 8, 2025 00:50
@@ -377,6 +388,10 @@ on the device that mirrors the node selector fields on the `ResourceSlice`.
the selector, while setting the `AllNodes` field to `true` means the
device is available on all nodes in the cluster.

1. The fields `NodeName`, `NodeSelector`, `AllNodes`, and `PerDeviceNodeSelection`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The consequence of this is that it is no longer possible to use a field selector on NodeName to fetch all ResourceSlices for a node. So in order to find the ResourceSlice that defines the counters for a device, a client will have to list them and then check them all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The consequence of this is that it is no longer possible to use a field selector on NodeName to fetch all ResourceSlices for a node.

That hasn't been guaranteed in the past either. It missed network-attached devices, for example (supported since adding ResourceSlice).

A more recent example is PerDeviceNodeSelection:
https://github.com/kubernetes/kubernetes/blob/d2c8e50db3ec4abcc3a1039f0c05242427204e98/staging/src/k8s.io/api/resource/v1/types.go#L157-L167

Speaking of that as it just occurred to me: because PerDeviceNodeSelection is mutually exclusive with NodeName, the resource slice controller cannot filter by node name, i.e. publishing such slices for a DRA driver has to happen in a control plane component. Was that considered when adding the field? We might have to document that gotcha in our staging repo docs and/or KEPs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. And the allocator fetches all slices in a pool anyway to check if it is valid and complete.

We did discuss the limitations of listing ResourceSlices with the nodeName in kubernetes/kubernetes#131744 (comment).

I don't think we ever called it out explicitly that ResourceSlices that use PerDeviceNodeSelection must be published by a control plane component, but that has always been the plan in our use-cases for the field. That would apply to all other options than nodeName right?

I can point that out in this KEP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowing counter sets and devices to be spread across multiple ResourceSlices also means that the allocator needs to make sure that all ResourceSlices "needed" for a device are on the latest generation of the pool. I think the current logic should handle this, since we drop any ResourceSlices that aren't on the latest generation. But we might need to keep track of out-dated ResourceSlices so we can distinguish between pointers to ResourceSlices that doesn't exist and pointers to counter sets in an outdated ResourceSlice.

@mortent mortent force-pushed the UpdatePartitionableFor135 branch from 8443d6e to b89af85 Compare September 9, 2025 00:00
@mortent
Copy link
Member Author

mortent commented Sep 9, 2025

/assign @johnbelamaric @pohly @klueska

I have updated the KEP to reflect the idea that SharedCounters must be specified in a separate ResourceSlice from devices. It does help simplify the validation logic, but it doesn't fully remove the need for some cross-ResourceSlice validation.

@helayoty helayoty moved this from Needs Triage to Needs Review in SIG Scheduling Sep 9, 2025
Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. But would like @klueska and @pohly to provide feedback as well.

that the references can actually be resolved during admission since we are only
able to validate a `ResourceSlice` in isolation. This means that users will not
discover mistakes until a `ResourceClaim` actually tries to allocate a device
that belongs to the `ResourcePool`. This means more complexity and a less user-friendly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

  1. ResourceSlice is not a human-created resource in the typical case. So, errors here would be bugs in the driver.

  2. We can have a controller that surfaces inconsistent pools / errors in a ResourceSlice's status, rather than waiting until scheduling time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I've updated the doc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have a controller that surfaces inconsistent pools / errors in a ResourceSlice's status

We have to be aware that such a controller can only report errors for "complete" pools (all ResourceSlices (as defined by ResourceSliceCount) with the same generation have been received). As long as at least on ResourceSlice is missing, it's possible that the missing one has a missing definition. Conflicting definitions can be detected sooner.

I've updated the doc.

Where? I'll keep reading, perhaps I'll find it...

//
// The maximum number of counters in all sets is 32.
// The maximum number of counter sets is 8.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I have multiple slices that define additional counter sets? 8 sounds too low pool-wide, otherwise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this proposal does not limit the number of ResourceSlices in a pool that can specify counter sets, and I don't see any reason why we would. I chose this relatively low number because it is pretty easy to just spread them across multiple ResourceSlices if more are needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that it's a single slice we can also make it larger.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, but I was thinking that we could keep the limit relatively low and prevent getting anywhere close to the etcd size limit. We can probably increase the limit in the future (over two releases), but reducing it is harder.

The ResourceSlice-wide limits are:
* The total number of consumed counters across all devices in a ResourceSlice
must not exceed 2048.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any ResourcePool limits?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think there are any other limits than the max number of ResourceSlices that can be expressed through the spec.pool.resourceSliceCount field. And it is an int64, so for all practical purposes I think we can consider it as unlimited. Performance will probably suffer at some point though, but I don't think we have any testing on that.

that actually exists. Similarly, we will not be able to identify ambigous references,
where there are multiple counter sets within a single `ResourcePool` with the same name.

This additional validation will happen during allocation, meaning that issues will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could do this in a controller, as I mentioned.

Copy link
Member Author

@mortent mortent Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I am mentioning this in the validation section. It still isn't as good as admission, since it will be async. But it is better than discovering it during allocation.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mortent
Once this PR has been reviewed and has the lgtm label, please ask for approval from johnbelamaric. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the approach, but we need to sort out some details.

@@ -377,6 +386,14 @@ on the device that mirrors the node selector fields on the `ResourceSlice`.
the selector, while setting the `AllNodes` field to `true` means the
device is available on all nodes in the cluster.

The `SharedCounters` field is mutually exlusive with the `Devices` field, meaning
that `SharedCounters` must always be specified in a different `ResourceSlice` than
devices consuming the counters. They must however be in the same `ResourcePool`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
devices consuming the counters. They must however be in the same `ResourcePool`.
devices consuming the counters. They must however be in the same `ResourcePool`
with the same `Generation`.


The `NodeName`, `NodeSelector`, `AllNodes`, and `PerDeviceNodeSelection` fields
can only be set for `ResourceSlice`s that specifies devices. So they must always
be unset for `ResourceSlices` that specified shared counters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would prevent publishing them through the ResourceSlice controller on nodes:

https://github.com/kubernetes/kubernetes/blob/d2c8e50db3ec4abcc3a1039f0c05242427204e98/staging/src/k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go#L432-L442

Validation with rules like "field X may only be set if field Y is set or has value foo" make for a complex API. I would leave the "Exactly one of NodeName, NodeSelector, AllNodes, and PerDeviceNodeSelection must be set." rule unchanged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in theory different ResourceSlices in a the same pool can have different values for those 4 fields. This is OK. But it means these values are not the same across a pool - if they were, then I could see arguing it should be set the same for the counter set slices. But without that, I don't know how to interpret the meaning of those fields for slice containing only shared counters (or mixins for that matter).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They have no use except for "NodeName": that determines whether a kubelet plugin can filter for its own slices.

For other fields: setting them the same as for other slices in the pool is a good guideline, but not really required.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the KEP to keep the node selector fields as required.

that the references can actually be resolved during admission since we are only
able to validate a `ResourceSlice` in isolation. This means that users will not
discover mistakes until a `ResourceClaim` actually tries to allocate a device
that belongs to the `ResourcePool`. This means more complexity and a less user-friendly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have a controller that surfaces inconsistent pools / errors in a ResourceSlice's status

We have to be aware that such a controller can only report errors for "complete" pools (all ResourceSlices (as defined by ResourceSliceCount) with the same generation have been received). As long as at least on ResourceSlice is missing, it's possible that the missing one has a missing definition. Conflicting definitions can be detected sooner.

I've updated the doc.

Where? I'll keep reading, perhaps I'll find it...

removes the possibility of ambiguous references.
* Introduce a controller that can validate that all references within a `ResourcePool`
are valid. It can then update a status on all `ResourceSlices` in the `ResourcePool`.
This will still be asynchronous, so the UX is not as good as validation during admission.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a future enhancement, to be defined in a separate KEP, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative is kubectl describe resourcepools or a dedicated sub-command.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if the helper validates it, then we can probably get away without the controller, and add it only if necessary in some future release. Obviously we still need to error out at scheduling time that there are inconsistencies found.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking of this as a possible enhancement in the future. I've clarified that in the KEP. I've also said we will include the client-side validation, which maybe makes it less likely that we will do this.

where there are multiple counter sets within a single `ResourcePool` with the same name.

This additional validation will happen during allocation, meaning that issues will
not be surfaced until a `ResourceClaim` needs to be allocated.
Copy link
Contributor

@pohly pohly Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will it be surfaced? This is not as trivial as it may seem.

Right now, the scheduler allocates devices from incomplete pools. That means it won't necessarily see conflicting definitions. This is also a problem for device taints in separate slices, because it would mean that the scheduler ignores such a taint when it hasn't received it yet. I'd be open to change the behavior so that the scheduler ignores incomplete pools. We never specified that it is okay to publish incomplete pools (= ResourceSliceCount higher than the actual number of slices) - it just happened to work. The helper code never did this, so I think it's safe to say that we are not breaking any real deployment if the tighten the behavior of the scheduler.

Now that the scheduler has only complete pools, it can:

  • Ignore "broken" devices and potentially allocate another one.
  • Refuse to allocate anything for the node and potentially allocate for another node.
  • Abort the scheduling of the pod and report a fatal error.

Some middle ground between the last two options would be to check whether a device matches the selectors and only abort if a selectable device has a problem. But this is not possible for mixin references because evaluating the selectors depends on expanding the mixins first. I don't think we should try to do this.

Was that last option the one you had in mind? We have to be aware that this means that one bad driver can halt scheduling of any pod with ResourceClaims by publishing one ResourceSlice with AllNodes: true and one bad device with a missing definition. I think this would be okay, but it's worth calling out if we go for the last option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed we should not allocate from incomplete pools. I didn't realize we allowed that today. We should require a fully consistent view. If there are lots of changes to a pool, this could get tricky, because we could never get a convergent picture of the pool. But I would consider that a badly behaved driver. We should document that frequent changes to the pool may cause such an issue - this is a reason why updating ResourceSlice is something that should not be super frequent. @aojea I know you do this in dranet - is this going to be a problem?

@pohly I don't understand your options. If the incomplete pool is ignored, a Pod that requests something will simply not have the option of getting anything from that pool. If it's a per-node pool, that would me no assignment to that node. Is that your first bullet? I don't think we need special logic there, and I don't think we need to go further than that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand your options. If the incomplete pool is ignored, a Pod that requests something will simply not have the option of getting anything from that pool.

A pool can be complete and broken in a way that some devices are usable and others are not (missing definition). Option 1 then allows allocating the non-broken devices, options 2 and 3 don't.

The difference between option 2 and 3 is that a pod may get scheduled when using option 2 in cases where option 3 wouldn't.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the proposal to clarify that we will only consider complete ResourcePools and we will fail and report a fatal error if we find an invalid ResourcePool. Overall I think failing fast is the best option in this case.

This additional validation will happen during allocation, meaning that issues will
not be surfaced until a `ResourceClaim` needs to be allocated.

There are ways we may be able to improve the experience:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option:

  • In the ResourceSlice controller helper, validate the ResourceSlice specs provided by the DRA driver author. This can catch mistakes before some broken pool gets published in the cluster. This only works for DRA drivers using the helper code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a good idea. I've updated the KEP to include it.

@mortent mortent force-pushed the UpdatePartitionableFor135 branch from b83d5be to 53a2197 Compare September 13, 2025 21:18
Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits, but overall LGTM.

PerDeviceNodeSelection bool

// SharedCounters defines a list of counter sets, each of which
// has a name and a list of counters available.
//
// The names of the SharedCounters must be unique in the ResourceSlice.
// The names of the Counter Sets must be unique in the ResourcePool.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: either use "CounterSets" (= Go type, plural form) or "counter sets" (English prose). I prefer the latter.

@@ -432,7 +460,7 @@ type ResourceSliceSpec struct {

// CounterSet defines a named set of counters
// that are available to be used by devices defined in the
// ResourceSlice.
// ResourcePool.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same nit: "resource pool" is a concept, not a Go type.


The allocator will perform additional validation when it tries to use the ResourceSlices for
allocation. It will:
* Only consider devices from complete `ResourcePools`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Only consider devices from complete `ResourcePools`.
* Only consider devices from complete resource pools.

The allocator will perform additional validation when it tries to use the ResourceSlices for
allocation. It will:
* Only consider devices from complete `ResourcePools`.
* Abort scheduling of the pod and report a fatal error if any of the complete `ResourcePools`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Abort scheduling of the pod and report a fatal error if any of the complete `ResourcePools`
* Abort scheduling of the pod and report a fatal error if any of the complete resource pools

fail validation.

This makes sure any errors are discovered as soon as possible and we avoid situations where
some devices from a `ResourcePool` might be eligible while others are not. This could lead to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
some devices from a `ResourcePool` might be eligible while others are not. This could lead to
some devices from a resource pool might be eligible while others are not. This could lead to


To try to prevent this situations from happening, we will add client-side validation in the
ResourceSlice controller helper, so that any errors in the ResourceSlices will be caught before
they even are applied to the APIServer. This will only work for controllers that use the helper
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
they even are applied to the APIServer. This will only work for controllers that use the helper
they even are applied to the APIServer. This will only work for drivers that use the helper


#### Future options

We can further improve the experience here by introducing a controller that can validate that all references within a `ResourcePool`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We can further improve the experience here by introducing a controller that can validate that all references within a `ResourcePool`
We can further improve the experience here by introducing a controller that can validate that all references within a resource pool

I'll stop calling this out now... please do a search/replace 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Needs Review
Development

Successfully merging this pull request may close these issues.

5 participants