-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[KEP-4815] Update KEP for 1.35 #5515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
mortent
commented
Sep 5, 2025
- One-line PR description: Update the design to simplify validation while preserving flexibility.
- Issue link: DRA: Partitionable Devices #4815
- Other comments:
901e51e
to
8443d6e
Compare
@@ -377,6 +388,10 @@ on the device that mirrors the node selector fields on the `ResourceSlice`. | |||
the selector, while setting the `AllNodes` field to `true` means the | |||
device is available on all nodes in the cluster. | |||
|
|||
1. The fields `NodeName`, `NodeSelector`, `AllNodes`, and `PerDeviceNodeSelection` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The consequence of this is that it is no longer possible to use a field selector on NodeName
to fetch all ResourceSlices for a node. So in order to find the ResourceSlice that defines the counters for a device, a client will have to list them and then check them all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The consequence of this is that it is no longer possible to use a field selector on NodeName to fetch all ResourceSlices for a node.
That hasn't been guaranteed in the past either. It missed network-attached devices, for example (supported since adding ResourceSlice).
A more recent example is PerDeviceNodeSelection
:
https://github.com/kubernetes/kubernetes/blob/d2c8e50db3ec4abcc3a1039f0c05242427204e98/staging/src/k8s.io/api/resource/v1/types.go#L157-L167
Speaking of that as it just occurred to me: because PerDeviceNodeSelection
is mutually exclusive with NodeName
, the resource slice controller cannot filter by node name, i.e. publishing such slices for a DRA driver has to happen in a control plane component. Was that considered when adding the field? We might have to document that gotcha in our staging repo docs and/or KEPs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. And the allocator fetches all slices in a pool anyway to check if it is valid and complete.
We did discuss the limitations of listing ResourceSlices with the nodeName
in kubernetes/kubernetes#131744 (comment).
I don't think we ever called it out explicitly that ResourceSlices that use PerDeviceNodeSelection
must be published by a control plane component, but that has always been the plan in our use-cases for the field. That would apply to all other options than nodeName
right?
I can point that out in this KEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allowing counter sets and devices to be spread across multiple ResourceSlices also means that the allocator needs to make sure that all ResourceSlices "needed" for a device are on the latest generation of the pool. I think the current logic should handle this, since we drop any ResourceSlices that aren't on the latest generation. But we might need to keep track of out-dated ResourceSlices so we can distinguish between pointers to ResourceSlices that doesn't exist and pointers to counter sets in an outdated ResourceSlice.
8443d6e
to
b89af85
Compare
/assign @johnbelamaric @pohly @klueska I have updated the KEP to reflect the idea that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that the references can actually be resolved during admission since we are only | ||
able to validate a `ResourceSlice` in isolation. This means that users will not | ||
discover mistakes until a `ResourceClaim` actually tries to allocate a device | ||
that belongs to the `ResourcePool`. This means more complexity and a less user-friendly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things:
-
ResourceSlice is not a human-created resource in the typical case. So, errors here would be bugs in the driver.
-
We can have a controller that surfaces inconsistent pools / errors in a ResourceSlice's status, rather than waiting until scheduling time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. I've updated the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have a controller that surfaces inconsistent pools / errors in a ResourceSlice's status
We have to be aware that such a controller can only report errors for "complete" pools (all ResourceSlices (as defined by ResourceSliceCount
) with the same generation have been received). As long as at least on ResourceSlice is missing, it's possible that the missing one has a missing definition. Conflicting definitions can be detected sooner.
I've updated the doc.
Where? I'll keep reading, perhaps I'll find it...
// | ||
// The maximum number of counters in all sets is 32. | ||
// The maximum number of counter sets is 8. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I have multiple slices that define additional counter sets? 8 sounds too low pool-wide, otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this proposal does not limit the number of ResourceSlices in a pool that can specify counter sets, and I don't see any reason why we would. I chose this relatively low number because it is pretty easy to just spread them across multiple ResourceSlices if more are needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that it's a single slice we can also make it larger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, but I was thinking that we could keep the limit relatively low and prevent getting anywhere close to the etcd size limit. We can probably increase the limit in the future (over two releases), but reducing it is harder.
The ResourceSlice-wide limits are: | ||
* The total number of consumed counters across all devices in a ResourceSlice | ||
must not exceed 2048. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any ResourcePool limits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't think there are any other limits than the max number of ResourceSlices that can be expressed through the spec.pool.resourceSliceCount
field. And it is an int64
, so for all practical purposes I think we can consider it as unlimited. Performance will probably suffer at some point though, but I don't think we have any testing on that.
that actually exists. Similarly, we will not be able to identify ambigous references, | ||
where there are multiple counter sets within a single `ResourcePool` with the same name. | ||
|
||
This additional validation will happen during allocation, meaning that issues will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could do this in a controller, as I mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I am mentioning this in the validation section. It still isn't as good as admission, since it will be async. But it is better than discovering it during allocation.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mortent The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the approach, but we need to sort out some details.
@@ -377,6 +386,14 @@ on the device that mirrors the node selector fields on the `ResourceSlice`. | |||
the selector, while setting the `AllNodes` field to `true` means the | |||
device is available on all nodes in the cluster. | |||
|
|||
The `SharedCounters` field is mutually exlusive with the `Devices` field, meaning | |||
that `SharedCounters` must always be specified in a different `ResourceSlice` than | |||
devices consuming the counters. They must however be in the same `ResourcePool`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
devices consuming the counters. They must however be in the same `ResourcePool`. | |
devices consuming the counters. They must however be in the same `ResourcePool` | |
with the same `Generation`. |
|
||
The `NodeName`, `NodeSelector`, `AllNodes`, and `PerDeviceNodeSelection` fields | ||
can only be set for `ResourceSlice`s that specifies devices. So they must always | ||
be unset for `ResourceSlices` that specified shared counters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would prevent publishing them through the ResourceSlice controller on nodes:
Validation with rules like "field X may only be set if field Y is set or has value foo" make for a complex API. I would leave the "Exactly one of NodeName, NodeSelector, AllNodes, and PerDeviceNodeSelection must be set." rule unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, in theory different ResourceSlices in a the same pool can have different values for those 4 fields. This is OK. But it means these values are not the same across a pool - if they were, then I could see arguing it should be set the same for the counter set slices. But without that, I don't know how to interpret the meaning of those fields for slice containing only shared counters (or mixins for that matter).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They have no use except for "NodeName": that determines whether a kubelet plugin can filter for its own slices.
For other fields: setting them the same as for other slices in the pool is a good guideline, but not really required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the KEP to keep the node selector fields as required.
that the references can actually be resolved during admission since we are only | ||
able to validate a `ResourceSlice` in isolation. This means that users will not | ||
discover mistakes until a `ResourceClaim` actually tries to allocate a device | ||
that belongs to the `ResourcePool`. This means more complexity and a less user-friendly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have a controller that surfaces inconsistent pools / errors in a ResourceSlice's status
We have to be aware that such a controller can only report errors for "complete" pools (all ResourceSlices (as defined by ResourceSliceCount
) with the same generation have been received). As long as at least on ResourceSlice is missing, it's possible that the missing one has a missing definition. Conflicting definitions can be detected sooner.
I've updated the doc.
Where? I'll keep reading, perhaps I'll find it...
removes the possibility of ambiguous references. | ||
* Introduce a controller that can validate that all references within a `ResourcePool` | ||
are valid. It can then update a status on all `ResourceSlices` in the `ResourcePool`. | ||
This will still be asynchronous, so the UX is not as good as validation during admission. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a future enhancement, to be defined in a separate KEP, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative is kubectl describe resourcepools
or a dedicated sub-command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if the helper validates it, then we can probably get away without the controller, and add it only if necessary in some future release. Obviously we still need to error out at scheduling time that there are inconsistencies found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was thinking of this as a possible enhancement in the future. I've clarified that in the KEP. I've also said we will include the client-side validation, which maybe makes it less likely that we will do this.
where there are multiple counter sets within a single `ResourcePool` with the same name. | ||
|
||
This additional validation will happen during allocation, meaning that issues will | ||
not be surfaced until a `ResourceClaim` needs to be allocated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will it be surfaced? This is not as trivial as it may seem.
Right now, the scheduler allocates devices from incomplete pools. That means it won't necessarily see conflicting definitions. This is also a problem for device taints in separate slices, because it would mean that the scheduler ignores such a taint when it hasn't received it yet. I'd be open to change the behavior so that the scheduler ignores incomplete pools. We never specified that it is okay to publish incomplete pools (= ResourceSliceCount
higher than the actual number of slices) - it just happened to work. The helper code never did this, so I think it's safe to say that we are not breaking any real deployment if the tighten the behavior of the scheduler.
Now that the scheduler has only complete pools, it can:
- Ignore "broken" devices and potentially allocate another one.
- Refuse to allocate anything for the node and potentially allocate for another node.
- Abort the scheduling of the pod and report a fatal error.
Some middle ground between the last two options would be to check whether a device matches the selectors and only abort if a selectable device has a problem. But this is not possible for mixin references because evaluating the selectors depends on expanding the mixins first. I don't think we should try to do this.
Was that last option the one you had in mind? We have to be aware that this means that one bad driver can halt scheduling of any pod with ResourceClaims by publishing one ResourceSlice with AllNodes: true
and one bad device with a missing definition. I think this would be okay, but it's worth calling out if we go for the last option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed we should not allocate from incomplete pools. I didn't realize we allowed that today. We should require a fully consistent view. If there are lots of changes to a pool, this could get tricky, because we could never get a convergent picture of the pool. But I would consider that a badly behaved driver. We should document that frequent changes to the pool may cause such an issue - this is a reason why updating ResourceSlice is something that should not be super frequent. @aojea I know you do this in dranet - is this going to be a problem?
@pohly I don't understand your options. If the incomplete pool is ignored, a Pod that requests something will simply not have the option of getting anything from that pool. If it's a per-node pool, that would me no assignment to that node. Is that your first bullet? I don't think we need special logic there, and I don't think we need to go further than that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand your options. If the incomplete pool is ignored, a Pod that requests something will simply not have the option of getting anything from that pool.
A pool can be complete and broken in a way that some devices are usable and others are not (missing definition). Option 1 then allows allocating the non-broken devices, options 2 and 3 don't.
The difference between option 2 and 3 is that a pod may get scheduled when using option 2 in cases where option 3 wouldn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the proposal to clarify that we will only consider complete ResourcePools and we will fail and report a fatal error if we find an invalid ResourcePool. Overall I think failing fast is the best option in this case.
This additional validation will happen during allocation, meaning that issues will | ||
not be surfaced until a `ResourceClaim` needs to be allocated. | ||
|
||
There are ways we may be able to improve the experience: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option:
- In the ResourceSlice controller helper, validate the ResourceSlice specs provided by the DRA driver author. This can catch mistakes before some broken pool gets published in the cluster. This only works for DRA drivers using the helper code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a good idea. I've updated the KEP to include it.
b83d5be
to
53a2197
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits, but overall LGTM.
PerDeviceNodeSelection bool | ||
|
||
// SharedCounters defines a list of counter sets, each of which | ||
// has a name and a list of counters available. | ||
// | ||
// The names of the SharedCounters must be unique in the ResourceSlice. | ||
// The names of the Counter Sets must be unique in the ResourcePool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: either use "CounterSets" (= Go type, plural form) or "counter sets" (English prose). I prefer the latter.
@@ -432,7 +460,7 @@ type ResourceSliceSpec struct { | |||
|
|||
// CounterSet defines a named set of counters | |||
// that are available to be used by devices defined in the | |||
// ResourceSlice. | |||
// ResourcePool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same nit: "resource pool" is a concept, not a Go type.
|
||
The allocator will perform additional validation when it tries to use the ResourceSlices for | ||
allocation. It will: | ||
* Only consider devices from complete `ResourcePools`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Only consider devices from complete `ResourcePools`. | |
* Only consider devices from complete resource pools. |
The allocator will perform additional validation when it tries to use the ResourceSlices for | ||
allocation. It will: | ||
* Only consider devices from complete `ResourcePools`. | ||
* Abort scheduling of the pod and report a fatal error if any of the complete `ResourcePools` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Abort scheduling of the pod and report a fatal error if any of the complete `ResourcePools` | |
* Abort scheduling of the pod and report a fatal error if any of the complete resource pools |
fail validation. | ||
|
||
This makes sure any errors are discovered as soon as possible and we avoid situations where | ||
some devices from a `ResourcePool` might be eligible while others are not. This could lead to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some devices from a `ResourcePool` might be eligible while others are not. This could lead to | |
some devices from a resource pool might be eligible while others are not. This could lead to |
|
||
To try to prevent this situations from happening, we will add client-side validation in the | ||
ResourceSlice controller helper, so that any errors in the ResourceSlices will be caught before | ||
they even are applied to the APIServer. This will only work for controllers that use the helper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they even are applied to the APIServer. This will only work for controllers that use the helper | |
they even are applied to the APIServer. This will only work for drivers that use the helper |
|
||
#### Future options | ||
|
||
We can further improve the experience here by introducing a controller that can validate that all references within a `ResourcePool` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can further improve the experience here by introducing a controller that can validate that all references within a `ResourcePool` | |
We can further improve the experience here by introducing a controller that can validate that all references within a resource pool |
I'll stop calling this out now... please do a search/replace 😅