Skip to content

Conversation

@archlitchi
Copy link

Currently, KAI-scheduler does not enforce resource-isolation when using gpu-sharing feature, related issues: issue#49, issue#45. This document introduces an approach to implement that.

It is merely a design document for now, but we plan to implement that if pass the evaluation.

Signed-off-by: limengxuan <[email protected]>
@archlitchi
Copy link
Author

@davidLif @omer-dayan could you help me review that~

@romanbaron
Copy link
Collaborator

Thank you for proposing this feature for KAI Scheduler and for your patience while we reviewed your suggestion.

Based on the PR details, I understand the key components of the proposed solution to be:

  1. A Hami Core daemonset that deploys libraries to cluster nodes
  2. A HosPath mount in the workload pod to access Hami Core libraries from the GPU node
  3. An environment variable specifying the amount of memory allocated to the running container

The technical aspects seem clear to me. However, I do have a couple of questions:

  1. What would the deployment model look like for such integration? Does Hami have a standalone k8s deployment / helm chart that can be installed alongside KAI Scheduler?
  2. I couldn’t find CUDA_DEVICE_MEMORY_LIMIT in the official CUDA documentation, so I assume it’s a custom variable name. Would it be possible to use a more neutral name? A generic naming approach might make it easier for other runtime support solutions to adopt it and help avoid confusion by not referencing CUDA.

@archlitchi
Copy link
Author

archlitchi commented Apr 17, 2025

Thank you for proposing this feature for KAI Scheduler and for your patience while we reviewed your suggestion.

Based on the PR details, I understand the key components of the proposed solution to be:

  1. A Hami Core daemonset that deploys libraries to cluster nodes
  2. A HosPath mount in the workload pod to access Hami Core libraries from the GPU node
  3. An environment variable specifying the amount of memory allocated to the running container

The technical aspects seem clear to me. However, I do have a couple of questions:

  1. What would the deployment model look like for such integration? Does Hami have a standalone k8s deployment / helm chart that can be installed alongside KAI Scheduler?
  2. I couldn’t find CUDA_DEVICE_MEMORY_LIMIT in the official CUDA documentation, so I assume it’s a custom variable name. Would it be possible to use a more neutral name? A generic naming approach might make it easier for other runtime support solutions to adopt it and help avoid confusion by not referencing CUDA.

Thanks for your reviewing, i'll try my best to answer

  1. HAMi do have a helm chart, but that includes a self-modified hami-nvidia-device-plugin(which contains HAMi-core), and a hami-scheduler. It can't be used with KAI scheduler because scheduling part conflicts, and hami-nvidia-device-plugin can't be working properly without hami-scheduler, because it uses information patched to pod annotations by hami-scheduler.
  2. So i have to design another approach for KAI-scheduler, and try to minimize the impact on KAI-scheduler in order to support this feature, The deployment model here is simple, only two steps are needed:
    -> install KAI-scheduler
    -> install hami-core distribution daemonset or manually copy hami-core to every GPU node
  3. For changing the env 'CUDA_DEVICE_MEMORY_LIMIT' to a more neutral name. Yes, definitely, we can sort this out

@wy0824
Copy link

wy0824 commented Apr 20, 2025

hello @archlitchi , I am looking for the gpu memory hard isolution way, I think this way is working for us. So what we need to change is update the kai-scheduler mutating webhook? And maybe the pod annotation design, current the annotation is pod scope not container scope.

@archlitchi
Copy link
Author

hello @archlitchi , I am looking for the gpu memory hard isolution way, I think this way is working for us. So what we need to change is update the kai-scheduler mutating webhook? And maybe the pod annotation design, current the annotation is pod scope not container scope.

yes, we need to inject environment variable into each container using shared-gpu according to pod annotations, in kai-scheduler mutating webhook

@romanbaron
Copy link
Collaborator

Thank you for proposing this feature for KAI Scheduler and for your patience while we reviewed your suggestion.
Based on the PR details, I understand the key components of the proposed solution to be:

  1. A Hami Core daemonset that deploys libraries to cluster nodes
  2. A HosPath mount in the workload pod to access Hami Core libraries from the GPU node
  3. An environment variable specifying the amount of memory allocated to the running container

The technical aspects seem clear to me. However, I do have a couple of questions:

  1. What would the deployment model look like for such integration? Does Hami have a standalone k8s deployment / helm chart that can be installed alongside KAI Scheduler?
  2. I couldn’t find CUDA_DEVICE_MEMORY_LIMIT in the official CUDA documentation, so I assume it’s a custom variable name. Would it be possible to use a more neutral name? A generic naming approach might make it easier for other runtime support solutions to adopt it and help avoid confusion by not referencing CUDA.

Thanks for your reviewing, i'll try my best to answer

  1. HAMi do have a helm chart, but that includes a self-modified hami-nvidia-device-plugin(which contains HAMi-core), and a hami-scheduler. It can't be used with KAI scheduler because scheduling part conflicts, and hami-nvidia-device-plugin can't be working properly without hami-scheduler, because it uses information patched to pod annotations by hami-scheduler.
  2. So i have to design another approach for KAI-scheduler, and try to minimize the impact on KAI-scheduler in order to support this feature, The deployment model here is simple, only two steps are needed:
    -> install KAI-scheduler
    -> install hami-core distribution daemonset or manually copy hami-core to every GPU node
  3. For changing the env 'CUDA_DEVICE_MEMORY_LIMIT' to a more neutral name. Yes, definitely, we can sort this out

Great to hear! On our side, it's crucial that the integration remains open, as we believe our users will greatly benefit from the flexibility to onboard additional frameworks at this level in the future. With that in mind, we're happy to align on the API, specifically the environment variable name and value format. I propose using GPU_MEMORY_LIMIT as the variable name, with its value expressed in bytes.

I suggest that the DaemonSet and the mutating webhook configuration be part of a separate, optional deployment. This allows users to opt-in by deploying it within their clusters. Once deployed, they’ll benefit from KAI Scheduler capabilities and memory isolation provided by Hami Core. I believe the Hami Core project is the right place to host the corresponding YAML files or deployment instructions. Within your development repository it will remain up to date and properly maintained in the future.

Here’s an outline of how I envision the process:

  1. Hami Core memory isolation components are deployed to the cluster based on your deployment/instructions.
  2. KAI Scheduler is deployed to the cluster using KAI Scheduler Helm Chart.
  3. Pod requesting GPU sharing (via fractional or memory-based request) is submitted:
    a. Hami Core mutating webhook injects a volume mount for the Hami Core library.
    b. KAI Scheduler injects the GPU_MEMORY_LIMIT environment variable.
  4. KAI Scheduler determines the appropriate node for the pod.
  5. Based on the scheduling decision, KAI Scheduler sets the value of the environment variable accordingly.
  6. Container is created, Hami Core libraries are mounted and memory isolation enforced based on GPU_MEMORY_LIMIT env var.

Hami-KAI

@wy0824
Copy link

wy0824 commented Apr 22, 2025

@romanbaron I agree to deploy hami-core and mutating webhook separately so as not to interfere with kai-scheulder's code. Will you provide mutating webhook?

@archlitchi
Copy link
Author

Thanks, @romanbaron, @wy0824 that's a nice approach, I'll set the TODO-list here according to the update, please help me review, i'll start working on it once review is passed.

  • Modification on the design document to match the update in docs/developer folder
  • Create an repository called 'KAI-resource-isolator' in HAMi-core org, which contains the 'mutatingWebhook' and 'hami-core daemonSet'
  • Document about using this feature in docs/gpu-sharing folder
  • Injecting GPU_MEMORY_LIMIT env in KAI-scheduler(i'm not sure if i can inject GPU_MEMORY_LIMIT in the binding phase, if not, i 'll put that part in pod-mutator)

@enoodle
Copy link
Collaborator

enoodle commented Apr 22, 2025

What decisions are made in the scheduler that effect the setting of this GPU_MEMORY_LIMIT environment variable?
To my understanding it is coming from the user, set as the value of "gpu-memory" label and we can use the downward API to direct it into an environment variable. (https://kubernetes.io/docs/concepts/workloads/pods/downward-api/)

@archlitchi
Copy link
Author

What decisions are made in the scheduler that effect the setting of this GPU_MEMORY_LIMIT environment variable? To my understanding it is coming from the user, set as the value of "gpu-memory" label and we can use the downward API to direct it into an environment variable. (https://kubernetes.io/docs/concepts/workloads/pods/downward-api/)

Yes, the value of 'GPU_MEMORY_LIMIT' is not effected by schedule decisions, that's why we can directly patch the environment variable in pod mutator

@romanbaron
Copy link
Collaborator

What decisions are made in the scheduler that effect the setting of this GPU_MEMORY_LIMIT environment variable? To my understanding it is coming from the user, set as the value of "gpu-memory" label and we can use the downward API to direct it into an environment variable. (https://kubernetes.io/docs/concepts/workloads/pods/downward-api/)

Yes, the value of 'GPU_MEMORY_LIMIT' is not effected by schedule decisions, that's why we can directly patch the environment variable in pod mutator

We support two types of GPU Sharing requests:

  1. GPU memory - explicit request for a specific GPU memory amount - known at pod creation phase.
  2. GPU fraction - pod can request a portion of a GPU (e.g. 0.5 of a GPU) - assuming a heterogenous cluster, this value can only be translated into GPU memory after the node is selected. Nodes with different GPU devices result with different GPU memory limit.

@romanbaron
Copy link
Collaborator

Thanks, @romanbaron, @wy0824 that's a nice approach, I'll set the TODO-list here according to the update, please help me review, i'll start working on it once review is passed.

  • Modification on the design document to match the update in docs/developer folder
  • Create an repository called 'KAI-resource-isolator' in HAMi-core org, which contains the 'mutatingWebhook' and 'hami-core daemonSet'
  • Document about using this feature in docs/gpu-sharing folder
  • Injecting GPU_MEMORY_LIMIT env in KAI-scheduler(i'm not sure if i can inject GPU_MEMORY_LIMIT in the binding phase, if not, i 'll put that part in pod-mutator)

I can take the KAI Scheduler part, I can add a document that describes the architecture around GPU_MEMORY_LIMIT env var that covers both types of GPU sharing requests. And then I can also implement it in KAI Scheduler.

Will you cover the part of KAI-resource-isolator repo?

If we agree, this PR is probably no longer needed then.

@archlitchi
Copy link
Author

Thanks, @romanbaron, @wy0824 that's a nice approach, I'll set the TODO-list here according to the update, please help me review, i'll start working on it once review is passed.

  • Modification on the design document to match the update in docs/developer folder
  • Create an repository called 'KAI-resource-isolator' in HAMi-core org, which contains the 'mutatingWebhook' and 'hami-core daemonSet'
  • Document about using this feature in docs/gpu-sharing folder
  • Injecting GPU_MEMORY_LIMIT env in KAI-scheduler(i'm not sure if i can inject GPU_MEMORY_LIMIT in the binding phase, if not, i 'll put that part in pod-mutator)

I can take the KAI Scheduler part, I can add a document that describes the architecture around GPU_MEMORY_LIMIT env var that covers both types of GPU sharing requests. And then I can also implement it in KAI Scheduler.

Will you cover the part of KAI-resource-isolator repo?

If we agree, this PR is probably no longer needed then.

of course, i can manage the 'KAI-resource-isolator'.
can i add a user-guide in doc/gpu-sharing folder or at least a link to 'KAI-resource-isolator' in your document? otherwise, users may not know where to find this 'KAI-resource-isolator' repository
And, do you have a slack or discord channel, we can keep in connection once this PR is closed

@wy0824
Copy link

wy0824 commented Apr 23, 2025

@romanbaron will kai scheduler support gpu core sharing, there is CUDA_DEVICE_SM_LIMIT in hami-core

@romanbaron
Copy link
Collaborator

@romanbaron will kai scheduler support gpu core sharing, there is CUDA_DEVICE_SM_LIMIT in hami-core

Maybe at some point, but it is not on our 2025 roadmap.
I just opened a new discussions category (https://github.com/NVIDIA/KAI-Scheduler/discussions/categories/roadmap) where you can open a discussion about it with our product manager.

@romanbaron
Copy link
Collaborator

Thanks, @romanbaron, @wy0824 that's a nice approach, I'll set the TODO-list here according to the update, please help me review, i'll start working on it once review is passed.

  • Modification on the design document to match the update in docs/developer folder
  • Create an repository called 'KAI-resource-isolator' in HAMi-core org, which contains the 'mutatingWebhook' and 'hami-core daemonSet'
  • Document about using this feature in docs/gpu-sharing folder
  • Injecting GPU_MEMORY_LIMIT env in KAI-scheduler(i'm not sure if i can inject GPU_MEMORY_LIMIT in the binding phase, if not, i 'll put that part in pod-mutator)

I can take the KAI Scheduler part, I can add a document that describes the architecture around GPU_MEMORY_LIMIT env var that covers both types of GPU sharing requests. And then I can also implement it in KAI Scheduler.
Will you cover the part of KAI-resource-isolator repo?
If we agree, this PR is probably no longer needed then.

of course, i can manage the 'KAI-resource-isolator'. can i add a user-guide in doc/gpu-sharing folder or at least a link to 'KAI-resource-isolator' in your document? otherwise, users may not know where to find this 'KAI-resource-isolator' repository And, do you have a slack or discord channel, we can keep in connection once this PR is closed

Thank you for the contribution and for suggesting the addition! At this time, in order to remain neutral with regard to third-party integrations, we're not adding references to specific integrations in our documentation.

On the topic of staying connected, we currently don't have a dedicated Slack channel, but we are working on creating one. We will keep you updated once it is set!

@archlitchi
Copy link
Author

Thank you for the contribution and for suggesting the addition! At this time, in order to remain neutral with regard to third-party integrations, we're not adding references to specific integrations in our documentation.

On the topic of staying connected, we currently don't have a dedicated Slack channel, but we are working on creating one. We will keep you updated once it is set!

ok, no problem, i'm working on KAI-resource-isolator now, in the meantime, i'll leave this PR open for sync and discussion.

@harche
Copy link

harche commented Apr 28, 2025

Is there any reason nvidia MPS is not considered for resource isolation?

/cc @mrunalp @EkinKarabulut

@EkinKarabulut
Copy link
Collaborator

Is there any reason nvidia MPS is not considered for resource isolation?

/cc @mrunalp @EkinKarabulut

Our goal is to remain agnostic to all tools and solutions (including resource isolation solutions), letting users choose whatever fits the best to their needs. That being said, MPS is compatible with the KAI scheduler as long as a couple of MPS related conditions are met on your setup:

  • The MPS server must be running on the node(s) where you want to schedule your workloads
  • The hostpath volume, which has the socket file (used for communicating with the MPS server), needs to be mounted to your workloads

With these in place, KAI will be able to schedule MPS workloads without issues. If you run into any unexpected behavior, please feel free to open an issue - we are more than happy to investigate and help out!

cc: @romanbaron @omer-dayan

@anencore94
Copy link

Is there any updates for this feature timeline? (hardware -level isolation)

@jj1kim jj1kim mentioned this pull request Aug 14, 2025
33 tasks
@testinfected
Copy link

Is there any updates for this feature timeline? (hardware -level isolation)

Same question. Is this still in the plans?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants