Resource isolation design #60

archlitchi · 2025-04-09T06:54:43Z

Currently, KAI-scheduler does not enforce resource-isolation when using gpu-sharing feature, related issues: issue#49, issue#45. This document introduces an approach to implement that.

It is merely a design document for now, but we plan to implement that if pass the evaluation.

Signed-off-by: limengxuan <[email protected]>

archlitchi · 2025-04-10T04:08:01Z

@davidLif @omer-dayan could you help me review that~

romanbaron · 2025-04-16T20:48:34Z

Thank you for proposing this feature for KAI Scheduler and for your patience while we reviewed your suggestion.

Based on the PR details, I understand the key components of the proposed solution to be:

A Hami Core daemonset that deploys libraries to cluster nodes
A HosPath mount in the workload pod to access Hami Core libraries from the GPU node
An environment variable specifying the amount of memory allocated to the running container

The technical aspects seem clear to me. However, I do have a couple of questions:

What would the deployment model look like for such integration? Does Hami have a standalone k8s deployment / helm chart that can be installed alongside KAI Scheduler?
I couldn’t find CUDA_DEVICE_MEMORY_LIMIT in the official CUDA documentation, so I assume it’s a custom variable name. Would it be possible to use a more neutral name? A generic naming approach might make it easier for other runtime support solutions to adopt it and help avoid confusion by not referencing CUDA.

archlitchi · 2025-04-17T02:35:40Z

Thank you for proposing this feature for KAI Scheduler and for your patience while we reviewed your suggestion.

Based on the PR details, I understand the key components of the proposed solution to be:

A Hami Core daemonset that deploys libraries to cluster nodes

A HosPath mount in the workload pod to access Hami Core libraries from the GPU node

An environment variable specifying the amount of memory allocated to the running container

The technical aspects seem clear to me. However, I do have a couple of questions:

What would the deployment model look like for such integration? Does Hami have a standalone k8s deployment / helm chart that can be installed alongside KAI Scheduler?

I couldn’t find CUDA_DEVICE_MEMORY_LIMIT in the official CUDA documentation, so I assume it’s a custom variable name. Would it be possible to use a more neutral name? A generic naming approach might make it easier for other runtime support solutions to adopt it and help avoid confusion by not referencing CUDA.

Thanks for your reviewing, i'll try my best to answer

HAMi do have a helm chart, but that includes a self-modified hami-nvidia-device-plugin(which contains HAMi-core), and a hami-scheduler. It can't be used with KAI scheduler because scheduling part conflicts, and hami-nvidia-device-plugin can't be working properly without hami-scheduler, because it uses information patched to pod annotations by hami-scheduler.
So i have to design another approach for KAI-scheduler, and try to minimize the impact on KAI-scheduler in order to support this feature, The deployment model here is simple, only two steps are needed:
-> install KAI-scheduler
-> install hami-core distribution daemonset or manually copy hami-core to every GPU node
For changing the env 'CUDA_DEVICE_MEMORY_LIMIT' to a more neutral name. Yes, definitely, we can sort this out

wy0824 · 2025-04-20T15:04:47Z

hello @archlitchi , I am looking for the gpu memory hard isolution way, I think this way is working for us. So what we need to change is update the kai-scheduler mutating webhook? And maybe the pod annotation design, current the annotation is pod scope not container scope.

archlitchi · 2025-04-21T01:53:00Z

hello @archlitchi , I am looking for the gpu memory hard isolution way, I think this way is working for us. So what we need to change is update the kai-scheduler mutating webhook? And maybe the pod annotation design, current the annotation is pod scope not container scope.

yes, we need to inject environment variable into each container using shared-gpu according to pod annotations, in kai-scheduler mutating webhook

romanbaron · 2025-04-21T19:41:27Z

Thank you for proposing this feature for KAI Scheduler and for your patience while we reviewed your suggestion.
Based on the PR details, I understand the key components of the proposed solution to be:

A Hami Core daemonset that deploys libraries to cluster nodes

A HosPath mount in the workload pod to access Hami Core libraries from the GPU node

An environment variable specifying the amount of memory allocated to the running container

The technical aspects seem clear to me. However, I do have a couple of questions:

What would the deployment model look like for such integration? Does Hami have a standalone k8s deployment / helm chart that can be installed alongside KAI Scheduler?

I couldn’t find CUDA_DEVICE_MEMORY_LIMIT in the official CUDA documentation, so I assume it’s a custom variable name. Would it be possible to use a more neutral name? A generic naming approach might make it easier for other runtime support solutions to adopt it and help avoid confusion by not referencing CUDA.

Thanks for your reviewing, i'll try my best to answer

HAMi do have a helm chart, but that includes a self-modified hami-nvidia-device-plugin(which contains HAMi-core), and a hami-scheduler. It can't be used with KAI scheduler because scheduling part conflicts, and hami-nvidia-device-plugin can't be working properly without hami-scheduler, because it uses information patched to pod annotations by hami-scheduler.

So i have to design another approach for KAI-scheduler, and try to minimize the impact on KAI-scheduler in order to support this feature, The deployment model here is simple, only two steps are needed:
-> install KAI-scheduler
-> install hami-core distribution daemonset or manually copy hami-core to every GPU node

For changing the env 'CUDA_DEVICE_MEMORY_LIMIT' to a more neutral name. Yes, definitely, we can sort this out

Great to hear! On our side, it's crucial that the integration remains open, as we believe our users will greatly benefit from the flexibility to onboard additional frameworks at this level in the future. With that in mind, we're happy to align on the API, specifically the environment variable name and value format. I propose using GPU_MEMORY_LIMIT as the variable name, with its value expressed in bytes.

I suggest that the DaemonSet and the mutating webhook configuration be part of a separate, optional deployment. This allows users to opt-in by deploying it within their clusters. Once deployed, they’ll benefit from KAI Scheduler capabilities and memory isolation provided by Hami Core. I believe the Hami Core project is the right place to host the corresponding YAML files or deployment instructions. Within your development repository it will remain up to date and properly maintained in the future.

Here’s an outline of how I envision the process:

Hami Core memory isolation components are deployed to the cluster based on your deployment/instructions.
KAI Scheduler is deployed to the cluster using KAI Scheduler Helm Chart.
Pod requesting GPU sharing (via fractional or memory-based request) is submitted:
a. Hami Core mutating webhook injects a volume mount for the Hami Core library.
b. KAI Scheduler injects the GPU_MEMORY_LIMIT environment variable.
KAI Scheduler determines the appropriate node for the pod.
Based on the scheduling decision, KAI Scheduler sets the value of the environment variable accordingly.
Container is created, Hami Core libraries are mounted and memory isolation enforced based on GPU_MEMORY_LIMIT env var.

wy0824 · 2025-04-22T02:21:54Z

@romanbaron I agree to deploy hami-core and mutating webhook separately so as not to interfere with kai-scheulder's code. Will you provide mutating webhook?

archlitchi · 2025-04-22T02:30:28Z

Thanks, @romanbaron, @wy0824 that's a nice approach, I'll set the TODO-list here according to the update, please help me review, i'll start working on it once review is passed.

Modification on the design document to match the update in docs/developer folder
Create an repository called 'KAI-resource-isolator' in HAMi-core org, which contains the 'mutatingWebhook' and 'hami-core daemonSet'
Document about using this feature in docs/gpu-sharing folder
Injecting GPU_MEMORY_LIMIT env in KAI-scheduler(i'm not sure if i can inject GPU_MEMORY_LIMIT in the binding phase, if not, i 'll put that part in pod-mutator)

enoodle · 2025-04-22T10:49:59Z

What decisions are made in the scheduler that effect the setting of this GPU_MEMORY_LIMIT environment variable?
To my understanding it is coming from the user, set as the value of "gpu-memory" label and we can use the downward API to direct it into an environment variable. (https://kubernetes.io/docs/concepts/workloads/pods/downward-api/)

archlitchi · 2025-04-22T11:36:37Z

What decisions are made in the scheduler that effect the setting of this GPU_MEMORY_LIMIT environment variable? To my understanding it is coming from the user, set as the value of "gpu-memory" label and we can use the downward API to direct it into an environment variable. (https://kubernetes.io/docs/concepts/workloads/pods/downward-api/)

Yes, the value of 'GPU_MEMORY_LIMIT' is not effected by schedule decisions, that's why we can directly patch the environment variable in pod mutator

romanbaron · 2025-04-22T17:41:44Z

What decisions are made in the scheduler that effect the setting of this GPU_MEMORY_LIMIT environment variable? To my understanding it is coming from the user, set as the value of "gpu-memory" label and we can use the downward API to direct it into an environment variable. (https://kubernetes.io/docs/concepts/workloads/pods/downward-api/)

Yes, the value of 'GPU_MEMORY_LIMIT' is not effected by schedule decisions, that's why we can directly patch the environment variable in pod mutator

We support two types of GPU Sharing requests:

GPU memory - explicit request for a specific GPU memory amount - known at pod creation phase.
GPU fraction - pod can request a portion of a GPU (e.g. 0.5 of a GPU) - assuming a heterogenous cluster, this value can only be translated into GPU memory after the node is selected. Nodes with different GPU devices result with different GPU memory limit.

romanbaron · 2025-04-22T17:50:33Z

Thanks, @romanbaron, @wy0824 that's a nice approach, I'll set the TODO-list here according to the update, please help me review, i'll start working on it once review is passed.

Modification on the design document to match the update in docs/developer folder

Create an repository called 'KAI-resource-isolator' in HAMi-core org, which contains the 'mutatingWebhook' and 'hami-core daemonSet'

Document about using this feature in docs/gpu-sharing folder

Injecting GPU_MEMORY_LIMIT env in KAI-scheduler(i'm not sure if i can inject GPU_MEMORY_LIMIT in the binding phase, if not, i 'll put that part in pod-mutator)

I can take the KAI Scheduler part, I can add a document that describes the architecture around GPU_MEMORY_LIMIT env var that covers both types of GPU sharing requests. And then I can also implement it in KAI Scheduler.

Will you cover the part of KAI-resource-isolator repo?

If we agree, this PR is probably no longer needed then.

archlitchi · 2025-04-23T02:17:41Z

Thanks, @romanbaron, @wy0824 that's a nice approach, I'll set the TODO-list here according to the update, please help me review, i'll start working on it once review is passed.

Modification on the design document to match the update in docs/developer folder

Create an repository called 'KAI-resource-isolator' in HAMi-core org, which contains the 'mutatingWebhook' and 'hami-core daemonSet'

Document about using this feature in docs/gpu-sharing folder

Injecting GPU_MEMORY_LIMIT env in KAI-scheduler(i'm not sure if i can inject GPU_MEMORY_LIMIT in the binding phase, if not, i 'll put that part in pod-mutator)

I can take the KAI Scheduler part, I can add a document that describes the architecture around GPU_MEMORY_LIMIT env var that covers both types of GPU sharing requests. And then I can also implement it in KAI Scheduler.

Will you cover the part of KAI-resource-isolator repo?

If we agree, this PR is probably no longer needed then.

of course, i can manage the 'KAI-resource-isolator'.
can i add a user-guide in doc/gpu-sharing folder or at least a link to 'KAI-resource-isolator' in your document? otherwise, users may not know where to find this 'KAI-resource-isolator' repository
And, do you have a slack or discord channel, we can keep in connection once this PR is closed

wy0824 · 2025-04-23T03:02:27Z

@romanbaron will kai scheduler support gpu core sharing, there is CUDA_DEVICE_SM_LIMIT in hami-core

romanbaron · 2025-04-23T10:13:02Z

@romanbaron will kai scheduler support gpu core sharing, there is CUDA_DEVICE_SM_LIMIT in hami-core

Maybe at some point, but it is not on our 2025 roadmap.
I just opened a new discussions category (https://github.com/NVIDIA/KAI-Scheduler/discussions/categories/roadmap) where you can open a discussion about it with our product manager.

romanbaron · 2025-04-23T10:43:47Z

Thanks, @romanbaron, @wy0824 that's a nice approach, I'll set the TODO-list here according to the update, please help me review, i'll start working on it once review is passed.

Modification on the design document to match the update in docs/developer folder

Create an repository called 'KAI-resource-isolator' in HAMi-core org, which contains the 'mutatingWebhook' and 'hami-core daemonSet'

Document about using this feature in docs/gpu-sharing folder

Injecting GPU_MEMORY_LIMIT env in KAI-scheduler(i'm not sure if i can inject GPU_MEMORY_LIMIT in the binding phase, if not, i 'll put that part in pod-mutator)

I can take the KAI Scheduler part, I can add a document that describes the architecture around GPU_MEMORY_LIMIT env var that covers both types of GPU sharing requests. And then I can also implement it in KAI Scheduler.
Will you cover the part of KAI-resource-isolator repo?
If we agree, this PR is probably no longer needed then.

of course, i can manage the 'KAI-resource-isolator'. can i add a user-guide in doc/gpu-sharing folder or at least a link to 'KAI-resource-isolator' in your document? otherwise, users may not know where to find this 'KAI-resource-isolator' repository And, do you have a slack or discord channel, we can keep in connection once this PR is closed

Thank you for the contribution and for suggesting the addition! At this time, in order to remain neutral with regard to third-party integrations, we're not adding references to specific integrations in our documentation.

On the topic of staying connected, we currently don't have a dedicated Slack channel, but we are working on creating one. We will keep you updated once it is set!

archlitchi · 2025-04-24T10:39:14Z

Thank you for the contribution and for suggesting the addition! At this time, in order to remain neutral with regard to third-party integrations, we're not adding references to specific integrations in our documentation.

On the topic of staying connected, we currently don't have a dedicated Slack channel, but we are working on creating one. We will keep you updated once it is set!

ok, no problem, i'm working on KAI-resource-isolator now, in the meantime, i'll leave this PR open for sync and discussion.

harche · 2025-04-28T19:47:34Z

Is there any reason nvidia MPS is not considered for resource isolation?

/cc @mrunalp @EkinKarabulut

EkinKarabulut · 2025-04-30T12:24:07Z

Is there any reason nvidia MPS is not considered for resource isolation?

/cc @mrunalp @EkinKarabulut

Our goal is to remain agnostic to all tools and solutions (including resource isolation solutions), letting users choose whatever fits the best to their needs. That being said, MPS is compatible with the KAI scheduler as long as a couple of MPS related conditions are met on your setup:

The MPS server must be running on the node(s) where you want to schedule your workloads
The hostpath volume, which has the socket file (used for communicating with the MPS server), needs to be mounted to your workloads

With these in place, KAI will be able to schedule MPS workloads without issues. If you run into any unexpected behavior, please feel free to open an issue - we are more than happy to investigate and help out!

cc: @romanbaron @omer-dayan

anencore94 · 2025-08-12T06:53:16Z

Is there any updates for this feature timeline? (hardware -level isolation)

testinfected · 2025-09-17T17:12:01Z

Is there any updates for this feature timeline? (hardware -level isolation)

Same question. Is this still in the plans?

update docs

088b02b

Signed-off-by: limengxuan <[email protected]>

This was referenced Apr 9, 2025

gpu fraction didnt reflect to nvidia-smi #49

Closed

How Is GPU Sharing Implemented? #45

Open

archlitchi mentioned this pull request Jun 19, 2025

Is KAI-Scheduler/Kueue compatible with HAMi? Project-HAMi/HAMi#1136

Open

jj1kim mentioned this pull request Aug 14, 2025

HAMi RoapMap Project-HAMi/HAMi#923

Closed

33 tasks

Resource isolation design #60

Are you sure you want to change the base?

Resource isolation design #60

Conversation

archlitchi commented Apr 9, 2025

Uh oh!

archlitchi commented Apr 10, 2025

Uh oh!

romanbaron commented Apr 16, 2025

Uh oh!

archlitchi commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wy0824 commented Apr 20, 2025

Uh oh!

archlitchi commented Apr 21, 2025

Uh oh!

romanbaron commented Apr 21, 2025

Uh oh!

wy0824 commented Apr 22, 2025

Uh oh!

archlitchi commented Apr 22, 2025

Uh oh!

enoodle commented Apr 22, 2025

Uh oh!

archlitchi commented Apr 22, 2025

Uh oh!

romanbaron commented Apr 22, 2025

Uh oh!

romanbaron commented Apr 22, 2025

Uh oh!

archlitchi commented Apr 23, 2025

Uh oh!

wy0824 commented Apr 23, 2025

Uh oh!

romanbaron commented Apr 23, 2025

Uh oh!

romanbaron commented Apr 23, 2025

Uh oh!

archlitchi commented Apr 24, 2025

Uh oh!

harche commented Apr 28, 2025

Uh oh!

EkinKarabulut commented Apr 30, 2025

Uh oh!

anencore94 commented Aug 12, 2025

Uh oh!

testinfected commented Sep 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

archlitchi commented Apr 17, 2025 •

edited

Loading