Skip to content

docs: small kep for DRA #853

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

acpana
Copy link

@acpana acpana commented Mar 24, 2025

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it:

Small KEP for #762

Which issue(s) this PR fixes:

Adresses #762

Special notes for your reviewer:

Does this PR introduce a user-facing change?

n/a

Signed-off-by: Alex Pana <[email protected]>
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Mar 24, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: acpana
Once this PR has been reviewed and has the lgtm label, please assign kannon92 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 24, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @acpana!

It looks like this is your first PR to kubernetes-sigs/jobset 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/jobset has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @acpana. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 24, 2025
Copy link

netlify bot commented Mar 24, 2025

Deploy Preview for kubernetes-sigs-jobset ready!

Name Link
🔨 Latest commit e1dfad8
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-jobset/deploys/67fe76e3f63c70000880a0d8
😎 Deploy Preview https://deploy-preview-853--kubernetes-sigs-jobset.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@acpana
Copy link
Author

acpana commented Mar 24, 2025

cc @johnbelamaric

Copy link

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, thanks

@kannon92
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 25, 2025

### Design Details

#### API Changes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the example but I’d also like to see the API you are proposing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

( ack, will add a section here once we close or get closer on some of the testing questions below )

@k8s-ci-robot
Copy link
Contributor

@acpana: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-jobset-test-e2e-main-1-30 47fd035 link true /test pull-jobset-test-e2e-main-1-30
pull-jobset-test-e2e-main-1-32 47fd035 link true /test pull-jobset-test-e2e-main-1-32

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

so that the platform can manage resources for complex multi-node, multi-pod jobs without manual intervention.


### Notes/Constraints/Caveats
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things I'll ask you to answer here.

  1. How does your proposal handle that DRA has various api levels between 1.30-1.32?

  2. How do you plan to test this functionality?
    a) Is there any way to test this cross node functionality with dra-example-driver?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we eventually get this work into batch/Job in upstream, what does that mean about this feature? Can you see a world where this exists and Job ResourceClaimTemplates exist together?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnbelamaric please keep me honest/ on track here but here's my thoughts:

How does your proposal handle that DRA has various api levels between 1.30-1.32?

One advantage of use adding the RCT as a top level JobSet change is that we can "decouple" from the core API changes. That said, we should encourage that people use the beta/ latest DRA API.

How do you plan to test this functionality?
a) Is there any way to test this cross node functionality with dra-example-driver?

AFAICT, the dra-example-driver doesn't have cross node support yet; Do you have any suggestions beyond manual testing?

Can you see a world where this exists and Job ResourceClaimTemplates exist together?

I'll defer to folks that have more insight into the future of DRA/ JobSet but my 2c are that having a semantic for making per JobSet device claims will continue to be useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One advantage of use adding the RCT as a top level JobSet change is that we can "decouple" from the core API changes. That said, we should encourage that people use the beta/ latest DRA API.

At some point, I imagine the RCT will create ResourceClaims via DRA, right? So in this case, we will eventually create DRA APIs where the API level for these differns from Kubernetes versions.

I'm trying to understand why you will use this feature for 1.30, 1.31 and 1.32 (where DRA is alpha, alpha and beta). JobSet runs on the latest three releases of Kubernetes.

AFAICT, the dra-example-driver doesn't have cross node support yet; Do you have any suggestions beyond manual testing?

If that is what we have, I guess that is fine. Do you have an environment where you can test this feature and confirm it works before merge?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bg-chun is working on beefing up the example driver to demonstrate various device modeling features (e.g., partitionable). We might though want a separate test driver?

The version is an interesting point. In 1.33 we have v1beta2, which we expect will be the shape of the API that we GA. I don't think we should support the alpha versions, frankly. Is it a problem for this particular feature to only work on 1.32+? Since that's when it went to beta? If we do that, then we can shape JobSet's embedded template based on the v1beta2 RCT, but actually generate v1beta1 RCTs (they can roundtrip, so this is do-able). That way, it works on 1.32 and 1.33. Eventually (when 1.34 is the oldest release), we change the RCTs we create to v1 not v1beta1.

Alternatively, we reference a separate RCT by name in the JobSet spec, and so the shape of that API no longer includes the template for the RCT, but pretty much just the "containers" bit that we have to layer in.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m currently working on adding a reference implementation for the Partitionable Device API to the example DRA driver. This API covers two scenarios: NVIDIA MIG and Multi-Host TPU Scheduling. However, since the current example DRA driver mimics NVIDIA GPUs, I’ll start by implementing a simple version of the MIG feature.

After adding this MIG-like implementation of the Partitionable Device API, I think we should have a discussion about the overall direction of the example device driver project. I believe the example DRA driver should ultimately demonstrate various device modeling scenarios.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @johnbelamaric!

That is the main items I am thinking through on this.

How do we handle different version for DRA?

I am okay with supporting 1.32+ but we should make it clear.

What happens with 1.30 and 1.31 if this API is used? Will there be a feature gate for JobSet?

Can we discover what kind of API is available in the APIServer and convert?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

( just chiming in here, I reckon the thinking is to go with:

I am okay with supporting 1.32+ but we should make it clear.

& feature gate it. We can error our if the gate is on but we don't meet the minimum API server requirement?)

Signed-off-by: Alex Pana <[email protected]>
@acpana acpana requested review from johnbelamaric and kannon92 April 7, 2025 18:36
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @johnbelamaric @acpana!
Maybe we could also speak about it during one of the following Batch WG calls.

image: helper-image:latest
command: ["sleep", "infinity"]
restartPolicy: OnFailure
resourceClaimTemplates: # NEW: JobSet-level ResourceClaimTemplates (added inside JobSpec)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this API is under replicatedJob[].template.spec.resourceClaimTemplates ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch! This should be one level higher 😛

Comment on lines +382 to +387

### Modifying CoreJob API

We could add a similar feature to the core Kubernetes Job API. However, this has a
much higher barrier to entry, wider impact, and longer development cycle compared to
enhancing an out-of-tree controller like JobSet.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is our strategy here @ahg-g @kannon92 @acpana @tenzen-y @johnbelamaric ?
We have similar features that we want to implement on the Job layer first: #572.

I am not sure if it is right for the long-term to push more features to the JobSet controller rather than Job controller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref: #762 (comment)

We did meet and discuss this feature at wg-device-management a few months back. I think there is a lot of interest in this and it is not clear if we put all of this in Job controller. If we want each replicated job to have a different resourceClaim, then I think it would make sense to have this just in Job API.

If we want multiple replicated jobs to share the same resourceClaim it may make sense to have it at the JobSet leve.

For LWS, @ahg-g mentions that even if there was support for this in StatefulSet it still may be necessary to have this for LWS to help tune the behavior LWS wants.

I also am open to experimenting more in out-of-tree controllers.

There are downsides to putting this in an out-of-tree controller though. We don't test/verify DRA in any capacity in our CI so we would need some kind of testing to verify that this works with DRA across different releases. We also inherit a potentially unstable API and if upstream adds suport for this feature maybe we have to deprecate.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For LWS, we want both the Leader and Worker StatefulSet to share a resource claim. Support at the StatefulSet level only would support the all the workers sharing one, and the leader having a separate resource claim.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my understanding from this discussion is:

  • We think that DRA eventually will be added to the in-tree workload controllers.
  • We think there are scenarios where it will not be sufficient to only rely on support in the in-tree controllers (like the LWS example), meaning that there needs to be some level of support for DRA out-of-tree.

What is not clear to me, is whether the implementation in JobSet would look different if there we had support in-tree, i.e. are we creating long-term challenges or inconsistencies if this is implemented in JobSet and we later also have support in Job/StatefulSet. Maybe it would be worth thinking about what that would look like?

As for the question around CI and unstable APIs, we are planning to get DRA to GA in the next release. That should make it pretty straightforward to test with DRA, although we need to handle the versioning for a few releases. And with the v1beta2 API that was added in 1.33, we don't expect any significant changes between v1beta2 and v1.

deviceClassName: shared-data-resource
```

#### Behavior Changes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so but I am not super familiar with PVCs at this point

Comment on lines +256 to +257
* Pod-level `resourceClaim`s with the same name as a JobSet-templated
claim will override the JobSet-level claim for that container, and a warning will be emitted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we solve it with webhook validation rather than warning ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could! But I'd prefer not to rely on webhooks given the complexity they introduce 🤔 . I can add a small blurb about that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JobSet already maintains Validation Webhook, so we can just add additional check for resourceClaim

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants