-
Notifications
You must be signed in to change notification settings - Fork 48
feat: Pathways use single resource group #574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Jobs requesting TPU resources may also have requests for CPU and memory. However when pathways is enabled, Kueue will not be able to admit such jobs since there is no cpu and memory quota. This fix adds a very high number of CPU and memory for TPU/GPU resources and merges the pathways resource group with the accelerator resource group. This also allows us to run AXLearn jobs without having to make changes manually.
@SujeethJinesh @Obliviour i didn't have time to test yet but this is basically what's needed for axlearn. |
This looks good to me, we just want to make sure that MaxText will work properly if it only requests for TPU and not TPU + CPU + Memory. |
pathways_resources = '' | ||
if enable_pathways: | ||
pathways_resources = """ | ||
- name: cpu-user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you can have two resource flavors with the same covered resources, IIUC. Please test before submitting this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested, this works fine.
@RoshaniN this is working fine in internal clusters.
|
Instructions on running maxtext
I believe this should work:
Just see that it starts and is placed on the right nodes should be enough |
Tried creating an xpk pathways workload by running this
Pods admitted and running, however pathways head node is ending up on a GKE TPU VM :(
I will need to work on this some more. |
Otherwise the pathways head pod will not first get assigned to CPU only resource flavors.
Verified that it works correctly now after fixing the order of the resource queues:
@SujeethJinesh can you please give it another review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Sam!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Sam!
@@ -480,30 +479,50 @@ def install_kueue_crs( | |||
|
|||
|
|||
def get_kueue_covered_resources_config( | |||
cluster_hardware_name, resource_type, total_chips | |||
cluster_hardware_name, resource_type, total_chips, enable_pathways |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would retain pathways specific helpers / logic in pathways specific files such as src/xpk/core/pathways.py , if that's possible and avoid changing function headers.
LGTM to the change though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider that originally but I personally feel it's cleaner to keep all Kueue config inside kueue.py. So when I want to change the kueue config, I just have to look at kueue.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way the code is now, I can easily understand what the covered resource config is going to look like without having to go between different functions and files.
So could we keep the PR as-is please?
Closing this PR because it was made using a fork. This is the new PR: #600 |
Jobs requesting TPU resources may also have requests for CPU and memory. However when pathways is enabled, Kueue will not be able to admit such jobs since there is no cpu and memory quota.
This fix adds a very high number of CPU and memory for TPU/GPU resources and merges the pathways resource group with the accelerator resource group.
This also allows us to run AXLearn jobs without having to make changes manually.
Testing / Documentation
Testing details.