[KEP-4815] Update KEP for 1.35 #5515

mortent · 2025-09-05T15:54:56Z

One-line PR description: Update the design to simplify validation while preserving flexibility.

Issue link: DRA: Partitionable Devices #4815

Other comments:

mortent · 2025-09-08T23:42:59Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

@@ -377,6 +388,10 @@ on the device that mirrors the node selector fields on the `ResourceSlice`.
      the selector, while setting the `AllNodes` field to `true` means the
      device is available on all nodes in the cluster.

+      1. The fields `NodeName`, `NodeSelector`, `AllNodes`, and `PerDeviceNodeSelection`


The consequence of this is that it is no longer possible to use a field selector on NodeName to fetch all ResourceSlices for a node. So in order to find the ResourceSlice that defines the counters for a device, a client will have to list them and then check them all.

The consequence of this is that it is no longer possible to use a field selector on NodeName to fetch all ResourceSlices for a node.

That hasn't been guaranteed in the past either. It missed network-attached devices, for example (supported since adding ResourceSlice).

A more recent example is PerDeviceNodeSelection:
https://github.com/kubernetes/kubernetes/blob/d2c8e50db3ec4abcc3a1039f0c05242427204e98/staging/src/k8s.io/api/resource/v1/types.go#L157-L167

Speaking of that as it just occurred to me: because PerDeviceNodeSelection is mutually exclusive with NodeName, the resource slice controller cannot filter by node name, i.e. publishing such slices for a DRA driver has to happen in a control plane component. Was that considered when adding the field? We might have to document that gotcha in our staging repo docs and/or KEPs.

Agree. And the allocator fetches all slices in a pool anyway to check if it is valid and complete.

We did discuss the limitations of listing ResourceSlices with the nodeName in kubernetes/kubernetes#131744 (comment).

I don't think we ever called it out explicitly that ResourceSlices that use PerDeviceNodeSelection must be published by a control plane component, but that has always been the plan in our use-cases for the field. That would apply to all other options than nodeName right?

I can point that out in this KEP.

Allowing counter sets and devices to be spread across multiple ResourceSlices also means that the allocator needs to make sure that all ResourceSlices "needed" for a device are on the latest generation of the pool. I think the current logic should handle this, since we drop any ResourceSlices that aren't on the latest generation. But we might need to keep track of out-dated ResourceSlices so we can distinguish between pointers to ResourceSlices that doesn't exist and pointers to counter sets in an outdated ResourceSlice.

mortent · 2025-09-09T00:06:12Z

/assign @johnbelamaric @pohly @klueska

I have updated the KEP to reflect the idea that SharedCounters must be specified in a separate ResourceSlice from devices. It does help simplify the validation logic, but it doesn't fully remove the need for some cross-ResourceSlice validation.

johnbelamaric

I like it. But would like @klueska and @pohly to provide feedback as well.

johnbelamaric · 2025-09-10T22:37:30Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

+that the references can actually be resolved during admission since we are only
+able to validate a `ResourceSlice` in isolation. This means that users will not
+discover mistakes until a `ResourceClaim` actually tries to allocate a device
+that belongs to the `ResourcePool`. This means more complexity and a less user-friendly


Two things:

ResourceSlice is not a human-created resource in the typical case. So, errors here would be bugs in the driver.

We can have a controller that surfaces inconsistent pools / errors in a ResourceSlice's status, rather than waiting until scheduling time.

Agree. I've updated the doc.

We can have a controller that surfaces inconsistent pools / errors in a ResourceSlice's status

We have to be aware that such a controller can only report errors for "complete" pools (all ResourceSlices (as defined by ResourceSliceCount) with the same generation have been received). As long as at least on ResourceSlice is missing, it's possible that the missing one has a missing definition. Conflicting definitions can be detected sooner.

I've updated the doc.

Where? I'll keep reading, perhaps I'll find it...

johnbelamaric · 2025-09-10T22:39:16Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

  //
-  // The maximum number of counters in all sets is 32.
+  // The maximum number of counter sets is 8.


Can I have multiple slices that define additional counter sets? 8 sounds too low pool-wide, otherwise.

Yeah, this proposal does not limit the number of ResourceSlices in a pool that can specify counter sets, and I don't see any reason why we would. I chose this relatively low number because it is pretty easy to just spread them across multiple ResourceSlices if more are needed.

Now that it's a single slice we can also make it larger.

We can, but I was thinking that we could keep the limit relatively low and prevent getting anywhere close to the etcd size limit. We can probably increase the limit in the future (over two releases), but reducing it is harder.

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

johnbelamaric · 2025-09-10T22:40:42Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

+The ResourceSlice-wide limits are:
+* The total number of consumed counters across all devices in a ResourceSlice
+  must not exceed 2048.
+


Are there any ResourcePool limits?

No, I don't think there are any other limits than the max number of ResourceSlices that can be expressed through the spec.pool.resourceSliceCount field. And it is an int64, so for all practical purposes I think we can consider it as unlimited. Performance will probably suffer at some point though, but I don't think we have any testing on that.

johnbelamaric · 2025-09-10T22:41:26Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

+that actually exists. Similarly, we will not be able to identify ambigous references,
+where there are multiple counter sets within a single `ResourcePool` with the same name.
+
+This additional validation will happen during allocation, meaning that issues will


I think we could do this in a controller, as I mentioned.

Agree, I am mentioning this in the validation section. It still isn't as good as admission, since it will be async. But it is better than discovering it during allocation.

k8s-ci-robot · 2025-09-11T00:20:39Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mortent
Once this PR has been reviewed and has the lgtm label, please ask for approval from johnbelamaric. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/sig-scheduling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pohly

I like the approach, but we need to sort out some details.

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

pohly · 2025-09-11T11:02:33Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

@@ -377,6 +386,14 @@ on the device that mirrors the node selector fields on the `ResourceSlice`.
      the selector, while setting the `AllNodes` field to `true` means the
      device is available on all nodes in the cluster.

+The `SharedCounters` field is mutually exlusive with the `Devices` field, meaning
+that `SharedCounters` must always be specified in a different `ResourceSlice` than
+devices consuming the counters. They must however be in the same `ResourcePool`.


Suggested change

devices consuming the counters. They must however be in the same `ResourcePool`.

devices consuming the counters. They must however be in the same `ResourcePool`

with the same `Generation`.

pohly · 2025-09-11T11:08:05Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

+
+The `NodeName`, `NodeSelector`, `AllNodes`, and `PerDeviceNodeSelection` fields
+can only be set for `ResourceSlice`s that specifies devices. So they must always
+be unset for `ResourceSlices` that specified shared counters.


This would prevent publishing them through the ResourceSlice controller on nodes:

https://github.com/kubernetes/kubernetes/blob/d2c8e50db3ec4abcc3a1039f0c05242427204e98/staging/src/k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go#L432-L442

Validation with rules like "field X may only be set if field Y is set or has value foo" make for a complex API. I would leave the "Exactly one of NodeName, NodeSelector, AllNodes, and PerDeviceNodeSelection must be set." rule unchanged.

So, in theory different ResourceSlices in a the same pool can have different values for those 4 fields. This is OK. But it means these values are not the same across a pool - if they were, then I could see arguing it should be set the same for the counter set slices. But without that, I don't know how to interpret the meaning of those fields for slice containing only shared counters (or mixins for that matter).

They have no use except for "NodeName": that determines whether a kubelet plugin can filter for its own slices.

For other fields: setting them the same as for other slices in the pool is a good guideline, but not really required.

I've updated the KEP to keep the node selector fields as required.

pohly · 2025-09-11T11:13:02Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

+that the references can actually be resolved during admission since we are only
+able to validate a `ResourceSlice` in isolation. This means that users will not
+discover mistakes until a `ResourceClaim` actually tries to allocate a device
+that belongs to the `ResourcePool`. This means more complexity and a less user-friendly


We can have a controller that surfaces inconsistent pools / errors in a ResourceSlice's status

We have to be aware that such a controller can only report errors for "complete" pools (all ResourceSlices (as defined by ResourceSliceCount) with the same generation have been received). As long as at least on ResourceSlice is missing, it's possible that the missing one has a missing definition. Conflicting definitions can be detected sooner.

I've updated the doc.

Where? I'll keep reading, perhaps I'll find it...

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

pohly · 2025-09-11T11:19:22Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

+  removes the possibility of ambiguous references.
+* Introduce a controller that can validate that all references within a `ResourcePool`
+  are valid. It can then update a status on all `ResourceSlices` in the `ResourcePool`.
+  This will still be asynchronous, so the UX is not as good as validation during admission.


This is a future enhancement, to be defined in a separate KEP, right?

An alternative is kubectl describe resourcepools or a dedicated sub-command.

I think if the helper validates it, then we can probably get away without the controller, and add it only if necessary in some future release. Obviously we still need to error out at scheduling time that there are inconsistencies found.

Yeah, I was thinking of this as a possible enhancement in the future. I've clarified that in the KEP. I've also said we will include the client-side validation, which maybe makes it less likely that we will do this.

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

pohly · 2025-09-11T12:19:30Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

+where there are multiple counter sets within a single `ResourcePool` with the same name.
+
+This additional validation will happen during allocation, meaning that issues will
+not be surfaced until a `ResourceClaim` needs to be allocated.


How will it be surfaced? This is not as trivial as it may seem.

Right now, the scheduler allocates devices from incomplete pools. That means it won't necessarily see conflicting definitions. This is also a problem for device taints in separate slices, because it would mean that the scheduler ignores such a taint when it hasn't received it yet. I'd be open to change the behavior so that the scheduler ignores incomplete pools. We never specified that it is okay to publish incomplete pools (= ResourceSliceCount higher than the actual number of slices) - it just happened to work. The helper code never did this, so I think it's safe to say that we are not breaking any real deployment if the tighten the behavior of the scheduler.

Now that the scheduler has only complete pools, it can:

Ignore "broken" devices and potentially allocate another one.

Refuse to allocate anything for the node and potentially allocate for another node.

Abort the scheduling of the pod and report a fatal error.

Some middle ground between the last two options would be to check whether a device matches the selectors and only abort if a selectable device has a problem. But this is not possible for mixin references because evaluating the selectors depends on expanding the mixins first. I don't think we should try to do this.

Was that last option the one you had in mind? We have to be aware that this means that one bad driver can halt scheduling of any pod with ResourceClaims by publishing one ResourceSlice with AllNodes: true and one bad device with a missing definition. I think this would be okay, but it's worth calling out if we go for the last option.

Agreed we should not allocate from incomplete pools. I didn't realize we allowed that today. We should require a fully consistent view. If there are lots of changes to a pool, this could get tricky, because we could never get a convergent picture of the pool. But I would consider that a badly behaved driver. We should document that frequent changes to the pool may cause such an issue - this is a reason why updating ResourceSlice is something that should not be super frequent. @aojea I know you do this in dranet - is this going to be a problem?

@pohly I don't understand your options. If the incomplete pool is ignored, a Pod that requests something will simply not have the option of getting anything from that pool. If it's a per-node pool, that would me no assignment to that node. Is that your first bullet? I don't think we need special logic there, and I don't think we need to go further than that.

I don't understand your options. If the incomplete pool is ignored, a Pod that requests something will simply not have the option of getting anything from that pool.

A pool can be complete and broken in a way that some devices are usable and others are not (missing definition). Option 1 then allows allocating the non-broken devices, options 2 and 3 don't.

The difference between option 2 and 3 is that a pod may get scheduled when using option 2 in cases where option 3 wouldn't.

I've updated the proposal to clarify that we will only consider complete ResourcePools and we will fail and report a fatal error if we find an invalid ResourcePool. Overall I think failing fast is the best option in this case.

pohly · 2025-09-11T12:21:24Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

+This additional validation will happen during allocation, meaning that issues will
+not be surfaced until a `ResourceClaim` needs to be allocated.
+
+There are ways we may be able to improve the experience:


Another option:

In the ResourceSlice controller helper, validate the ResourceSlice specs provided by the DRA driver author. This can catch mistakes before some broken pool gets published in the cluster. This only works for DRA drivers using the helper code.

Yeah, this is a good idea. I've updated the KEP to include it.

pohly

Some nits, but overall LGTM.

pohly · 2025-09-16T09:28:16Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

  PerDeviceNodeSelection bool

  // SharedCounters defines a list of counter sets, each of which
  // has a name and a list of counters available.
  //
-  // The names of the SharedCounters must be unique in the ResourceSlice.
+  // The names of the Counter Sets must be unique in the ResourcePool.


nit: either use "CounterSets" (= Go type, plural form) or "counter sets" (English prose). I prefer the latter.

pohly · 2025-09-16T09:29:17Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

@@ -432,7 +460,7 @@ type ResourceSliceSpec struct {

 // CounterSet defines a named set of counters
 // that are available to be used by devices defined in the
-// ResourceSlice.
+// ResourcePool.


Same nit: "resource pool" is a concept, not a Go type.

pohly · 2025-09-16T09:31:50Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

+
+The allocator will perform additional validation when it tries to use the ResourceSlices for
+allocation. It will:
+* Only consider devices from complete `ResourcePools`.


Suggested change

* Only consider devices from complete `ResourcePools`.

* Only consider devices from complete resource pools.

pohly · 2025-09-16T09:32:27Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

+The allocator will perform additional validation when it tries to use the ResourceSlices for
+allocation. It will:
+* Only consider devices from complete `ResourcePools`.
+* Abort scheduling of the pod and report a fatal error if any of the complete `ResourcePools`


Suggested change

* Abort scheduling of the pod and report a fatal error if any of the complete `ResourcePools`

* Abort scheduling of the pod and report a fatal error if any of the complete resource pools

pohly · 2025-09-16T09:32:57Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

+  fail validation.
+
+This makes sure any errors are discovered as soon as possible and we avoid situations where
+some devices from a `ResourcePool` might be eligible while others are not. This could lead to


Suggested change

some devices from a `ResourcePool` might be eligible while others are not. This could lead to

some devices from a resource pool might be eligible while others are not. This could lead to

pohly · 2025-09-16T09:33:51Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

+
+To try to prevent this situations from happening, we will add client-side validation in the
+ResourceSlice controller helper, so that any errors in the ResourceSlices will be caught before
+they even are applied to the APIServer. This will only work for controllers that use the helper


Suggested change

they even are applied to the APIServer. This will only work for controllers that use the helper

they even are applied to the APIServer. This will only work for drivers that use the helper

pohly · 2025-09-16T09:34:18Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

+
+#### Future options
+
+We can further improve the experience here by introducing a controller that can validate that all references within a `ResourcePool`


Suggested change

We can further improve the experience here by introducing a controller that can validate that all references within a `ResourcePool`

We can further improve the experience here by introducing a controller that can validate that all references within a resource pool

I'll stop calling this out now... please do a search/replace 😅

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Sep 5, 2025

k8s-ci-robot requested review from dom4ha and macsko September 5, 2025 15:55

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Sep 5, 2025

github-project-automation bot added this to SIG Scheduling Sep 5, 2025

github-project-automation bot moved this to Needs Triage in SIG Scheduling Sep 5, 2025

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 5, 2025

mortent force-pushed the UpdatePartitionableFor135 branch 4 times, most recently from 901e51e to 8443d6e Compare September 8, 2025 00:50

mortent commented Sep 8, 2025

View reviewed changes

[KEP-4815] Update KEP for 1.35

b89af85

mortent force-pushed the UpdatePartitionableFor135 branch from 8443d6e to b89af85 Compare September 9, 2025 00:00

k8s-ci-robot assigned johnbelamaric, klueska and pohly Sep 9, 2025

helayoty moved this from Needs Triage to Needs Review in SIG Scheduling Sep 9, 2025

johnbelamaric reviewed Sep 10, 2025

View reviewed changes

Addressed comments

e1424c8

pohly reviewed Sep 11, 2025

View reviewed changes

Addressed more comments

53a2197

mortent force-pushed the UpdatePartitionableFor135 branch from b83d5be to 53a2197 Compare September 13, 2025 21:18

pohly requested changes Sep 16, 2025

View reviewed changes

Addressed more comments

184d8e1

	devices consuming the counters. They must however be in the same `ResourcePool`.
	devices consuming the counters. They must however be in the same `ResourcePool`
	with the same `Generation`.

	* Only consider devices from complete `ResourcePools`.
	* Only consider devices from complete resource pools.

	* Abort scheduling of the pod and report a fatal error if any of the complete `ResourcePools`
	* Abort scheduling of the pod and report a fatal error if any of the complete resource pools

	some devices from a `ResourcePool` might be eligible while others are not. This could lead to
	some devices from a resource pool might be eligible while others are not. This could lead to

	they even are applied to the APIServer. This will only work for controllers that use the helper
	they even are applied to the APIServer. This will only work for drivers that use the helper


		#### Future options

		We can further improve the experience here by introducing a controller that can validate that all references within a `ResourcePool`

[KEP-4815] Update KEP for 1.35 #5515

Are you sure you want to change the base?

[KEP-4815] Update KEP for 1.35 #5515

Conversation

mortent commented Sep 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mortent commented Sep 9, 2025

Uh oh!

johnbelamaric left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mortent Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Sep 11, 2025

Uh oh!

pohly left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pohly Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

mortent Sep 11, 2025 •

edited

Loading

pohly Sep 11, 2025 •

edited

Loading