-
Notifications
You must be signed in to change notification settings - Fork 83
docs: small kep for DRA #853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Alex Pana <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: acpana The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Welcome @acpana! |
Hi @acpana. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
✅ Deploy Preview for kubernetes-sigs-jobset ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome, thanks
/ok-to-test |
|
||
### Design Details | ||
|
||
#### API Changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the example but I’d also like to see the API you are proposing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
( ack, will add a section here once we close or get closer on some of the testing questions below )
@acpana: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
so that the platform can manage resources for complex multi-node, multi-pod jobs without manual intervention. | ||
|
||
|
||
### Notes/Constraints/Caveats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few things I'll ask you to answer here.
-
How does your proposal handle that DRA has various api levels between 1.30-1.32?
-
How do you plan to test this functionality?
a) Is there any way to test this cross node functionality with dra-example-driver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we eventually get this work into batch/Job in upstream, what does that mean about this feature? Can you see a world where this exists and Job ResourceClaimTemplates exist together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnbelamaric please keep me honest/ on track here but here's my thoughts:
How does your proposal handle that DRA has various api levels between 1.30-1.32?
One advantage of use adding the RCT as a top level JobSet change is that we can "decouple" from the core API changes. That said, we should encourage that people use the beta/ latest DRA API.
How do you plan to test this functionality?
a) Is there any way to test this cross node functionality with dra-example-driver?
AFAICT, the dra-example-driver
doesn't have cross node support yet; Do you have any suggestions beyond manual testing?
Can you see a world where this exists and Job ResourceClaimTemplates exist together?
I'll defer to folks that have more insight into the future of DRA/ JobSet but my 2c are that having a semantic for making per JobSet device claims will continue to be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One advantage of use adding the RCT as a top level JobSet change is that we can "decouple" from the core API changes. That said, we should encourage that people use the beta/ latest DRA API.
At some point, I imagine the RCT will create ResourceClaims via DRA, right? So in this case, we will eventually create DRA APIs where the API level for these differns from Kubernetes versions.
I'm trying to understand why you will use this feature for 1.30, 1.31 and 1.32 (where DRA is alpha, alpha and beta). JobSet runs on the latest three releases of Kubernetes.
AFAICT, the dra-example-driver doesn't have cross node support yet; Do you have any suggestions beyond manual testing?
If that is what we have, I guess that is fine. Do you have an environment where you can test this feature and confirm it works before merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bg-chun is working on beefing up the example driver to demonstrate various device modeling features (e.g., partitionable). We might though want a separate test driver?
The version is an interesting point. In 1.33 we have v1beta2, which we expect will be the shape of the API that we GA. I don't think we should support the alpha versions, frankly. Is it a problem for this particular feature to only work on 1.32+? Since that's when it went to beta? If we do that, then we can shape JobSet's embedded template based on the v1beta2 RCT, but actually generate v1beta1 RCTs (they can roundtrip, so this is do-able). That way, it works on 1.32 and 1.33. Eventually (when 1.34 is the oldest release), we change the RCTs we create to v1 not v1beta1.
Alternatively, we reference a separate RCT by name in the JobSet spec, and so the shape of that API no longer includes the template for the RCT, but pretty much just the "containers" bit that we have to layer in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m currently working on adding a reference implementation for the Partitionable Device API to the example DRA driver. This API covers two scenarios: NVIDIA MIG and Multi-Host TPU Scheduling. However, since the current example DRA driver mimics NVIDIA GPUs, I’ll start by implementing a simple version of the MIG feature.
After adding this MIG-like implementation of the Partitionable Device API, I think we should have a discussion about the overall direction of the example device driver project. I believe the example DRA driver should ultimately demonstrate various device modeling scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @johnbelamaric!
That is the main items I am thinking through on this.
How do we handle different version for DRA?
I am okay with supporting 1.32+ but we should make it clear.
What happens with 1.30 and 1.31 if this API is used? Will there be a feature gate for JobSet?
Can we discover what kind of API is available in the APIServer and convert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
( just chiming in here, I reckon the thinking is to go with:
I am okay with supporting 1.32+ but we should make it clear.
& feature gate it. We can error our if the gate is on but we don't meet the minimum API server requirement?)
Signed-off-by: Alex Pana <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this @johnbelamaric @acpana!
Maybe we could also speak about it during one of the following Batch WG calls.
image: helper-image:latest | ||
command: ["sleep", "infinity"] | ||
restartPolicy: OnFailure | ||
resourceClaimTemplates: # NEW: JobSet-level ResourceClaimTemplates (added inside JobSpec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this API is under replicatedJob[].template.spec.resourceClaimTemplates
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good catch! This should be one level higher 😛
|
||
### Modifying CoreJob API | ||
|
||
We could add a similar feature to the core Kubernetes Job API. However, this has a | ||
much higher barrier to entry, wider impact, and longer development cycle compared to | ||
enhancing an out-of-tree controller like JobSet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xref: kubernetes-sigs/lws#444
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xref: #762 (comment)
We did meet and discuss this feature at wg-device-management a few months back. I think there is a lot of interest in this and it is not clear if we put all of this in Job controller. If we want each replicated job to have a different resourceClaim, then I think it would make sense to have this just in Job API.
If we want multiple replicated jobs to share the same resourceClaim it may make sense to have it at the JobSet leve.
For LWS, @ahg-g mentions that even if there was support for this in StatefulSet it still may be necessary to have this for LWS to help tune the behavior LWS wants.
I also am open to experimenting more in out-of-tree controllers.
There are downsides to putting this in an out-of-tree controller though. We don't test/verify DRA in any capacity in our CI so we would need some kind of testing to verify that this works with DRA across different releases. We also inherit a potentially unstable API and if upstream adds suport for this feature maybe we have to deprecate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For LWS, we want both the Leader and Worker StatefulSet to share a resource claim. Support at the StatefulSet level only would support the all the workers sharing one, and the leader having a separate resource claim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my understanding from this discussion is:
- We think that DRA eventually will be added to the in-tree workload controllers.
- We think there are scenarios where it will not be sufficient to only rely on support in the in-tree controllers (like the LWS example), meaning that there needs to be some level of support for DRA out-of-tree.
What is not clear to me, is whether the implementation in JobSet would look different if there we had support in-tree, i.e. are we creating long-term challenges or inconsistencies if this is implemented in JobSet and we later also have support in Job/StatefulSet. Maybe it would be worth thinking about what that would look like?
As for the question around CI and unstable APIs, we are planning to get DRA to GA in the next release. That should make it pretty straightforward to test with DRA, although we need to handle the versioning for a few releases. And with the v1beta2 API that was added in 1.33, we don't expect any significant changes between v1beta2 and v1.
deviceClassName: shared-data-resource | ||
``` | ||
|
||
#### Behavior Changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is similar to PVC Template for StatefulSet, right?
https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#volume-claim-templates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so but I am not super familiar with PVCs at this point
* Pod-level `resourceClaim`s with the same name as a JobSet-templated | ||
claim will override the JobSet-level claim for that container, and a warning will be emitted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we solve it with webhook validation rather than warning ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could! But I'd prefer not to rely on webhooks given the complexity they introduce 🤔 . I can add a small blurb about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JobSet already maintains Validation Webhook, so we can just add additional check for resourceClaim
Co-authored-by: Andrey Velichkevich <[email protected]>
What type of PR is this?
/kind feature
/kind api-change
What this PR does / why we need it:
Small KEP for #762
Which issue(s) this PR fixes:
Adresses #762
Special notes for your reviewer:
Does this PR introduce a user-facing change?
n/a