Skip to content

Conversation

samos123
Copy link
Collaborator

Jobs requesting TPU resources may also have requests for CPU and memory. However when pathways is enabled, Kueue will not be able to admit such jobs since there is no cpu and memory quota.

This fix adds a very high number of CPU and memory for TPU/GPU resources and merges the pathways resource group with the accelerator resource group.

This also allows us to run AXLearn jobs without having to make changes manually.

Follow up from: #574
this time with a branch within xpk repo.

Jobs requesting TPU resources may also have requests for CPU and memory.
However when pathways is enabled, Kueue will not be able to admit such
jobs since there is no cpu and memory quota.

This fix adds a very high number of CPU and memory for TPU/GPU resources
and merges the pathways resource group with the accelerator resource
group.

This also allows us to run AXLearn jobs without having to make changes
manually.
Otherwise the pathways head pod will not first get assigned to CPU only
resource flavors.
@samos123
Copy link
Collaborator Author

Seems @lukebaumann encountered an issue when not using create-pathways. A potential fix is to remove the create-pathways command all together since it doesn't seem needed. We may be able to get rid of cpu resource flavor which would also unblock AXLearn jobs.

@samos123
Copy link
Collaborator Author

Seems NAP without pathways is also impacted. I think we need a different fix. See #603

@samos123
Copy link
Collaborator Author

This PR should solve NAP and AXLearn support as well. Would prefer to get this merged and will check with Luke on why it wasn't working for him.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant