-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[KEP-5440]: Mutable container resources on PodTemplates for suspended jobs #5441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[KEP-5440]: Mutable container resources on PodTemplates for suspended jobs #5441
Conversation
kannon92
commented
Jun 28, 2025
- One-line PR description: Initial KEP draft for KEP-5440
- Issue link: Mutable Container Resources when Job is suspended #5440
- Other comments: Need to gain consesus among sig-apps first.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kannon92 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
2c9abe0
to
f6e6006
Compare
/hold |
f6e6006
to
b76ca88
Compare
Introduce a new KEP proposal to allow updating container resource specifications (CPU, memory, GPU, extended resources) for suspended jobs. Key features: - Enable dynamic resource allocation for suspended jobs only - Support CPU, memory, and GPU resource mutations - Include extended resources (nvidia.com/gpu, amd.com/gpu, tpu-v4, etc.) - Allow queue controllers to optimize resource allocation based on cluster conditions - Feature gate: MutableJobPodResourcesForSuspendedJobs - Focus on batch workload optimization scenarios This proposal enables better cluster utilization and cost optimization by allowing queue controllers to adjust job resource requirements before execution based on real-time cluster capacity and resource availability. Particularly valuable for expensive GPU and specialized hardware resources.
fe077e5
to
dd37c7e
Compare