feat: Pathways use single resource group #574

samos123 · 2025-08-03T18:24:05Z

Jobs requesting TPU resources may also have requests for CPU and memory. However when pathways is enabled, Kueue will not be able to admit such jobs since there is no cpu and memory quota.

This fix adds a very high number of CPU and memory for TPU/GPU resources and merges the pathways resource group with the accelerator resource group.

This also allows us to run AXLearn jobs without having to make changes manually.

Testing / Documentation

Testing details.

[ y/n ] Tests pass
[ y/n ] Appropriate changes to documentation are included in the PR

Jobs requesting TPU resources may also have requests for CPU and memory. However when pathways is enabled, Kueue will not be able to admit such jobs since there is no cpu and memory quota. This fix adds a very high number of CPU and memory for TPU/GPU resources and merges the pathways resource group with the accelerator resource group. This also allows us to run AXLearn jobs without having to make changes manually.

samos123 · 2025-08-03T18:25:29Z

@SujeethJinesh @Obliviour i didn't have time to test yet but this is basically what's needed for axlearn.

SujeethJinesh · 2025-08-05T17:50:49Z

This looks good to me, we just want to make sure that MaxText will work properly if it only requests for TPU and not TPU + CPU + Memory.

RoshaniN · 2025-08-05T22:53:58Z

src/xpk/core/kueue.py

+  pathways_resources = ''
+  if enable_pathways:
+    pathways_resources = """
+    - name: cpu-user


I don't think you can have two resource flavors with the same covered resources, IIUC. Please test before submitting this change.

Tested, this works fine.

samos123 · 2025-08-07T04:51:05Z

@RoshaniN this is working fine in internal clusters.

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  annotations:
  creationTimestamp: "2025-07-09T06:25:18Z"
  finalizers:
  - kueue.x-k8s.io/resource-in-use
  generation: 5
  name: cluster-queue
  resourceVersion: "1754542048065007021"
  uid: c1c2ecda-25af-4203-9b39-be78795c1dff
spec:
  flavorFungibility:
    whenCanBorrow: Borrow
    whenCanPreempt: TryNextFlavor
  namespaceSelector: {}
  preemption:
    borrowWithinCohort:
      policy: Never
    reclaimWithinCohort: Never
    withinClusterQueue: LowerPriority
  queueingStrategy: BestEffortFIFO
  resourceGroups:
  - coveredResources:
    - cpu
    - memory
    - google.com/tpu
    flavors:
    - name: cpu-user
      resources:
      - name: cpu
        nominalQuota: 480
      - name: memory
        nominalQuota: 2000G
      - name: google.com/tpu
        nominalQuota: 0
    - name: 4xv6e-256
      resources:
      - name: cpu
        nominalQuota: 99999999999
      - name: memory
        nominalQuota: 9999999Ti
      - name: google.com/tpu
        nominalQuota: 1024
  stopPolicy: None
status:
  admittedWorkloads: 4
  conditions:
  - lastTransitionTime: "2025-07-15T17:49:38Z"
    message: Can admit new workloads
    observedGeneration: 5
    reason: Ready
    status: "True"
    type: Active
  flavorsReservation:
  - name: cpu-user
    resources:
    - borrowed: "0"
      name: cpu
      total: "0"
    - borrowed: "0"
      name: google.com/tpu
      total: "0"
    - borrowed: "0"
      name: memory
      total: "0"
  - name: 4xv6e-256
    resources:
    - borrowed: "0"
      name: cpu
      total: "0"
    - borrowed: "0"
      name: google.com/tpu
      total: "1024"
    - borrowed: "0"
      name: memory
      total: "0"
  flavorsUsage:
  - name: cpu-user
    resources:
    - borrowed: "0"
      name: cpu
      total: "0"
    - borrowed: "0"
      name: google.com/tpu
      total: "0"
    - borrowed: "0"
      name: memory
      total: "0"
  - name: 4xv6e-256
    resources:
    - borrowed: "0"
      name: cpu
      total: "0"
    - borrowed: "0"
      name: google.com/tpu
      total: "1024"
    - borrowed: "0"
      name: memory
      total: "0"
  pendingWorkloads: 14
  reservingWorkloads: 4

SujeethJinesh · 2025-08-18T17:17:43Z

Instructions on running maxtext

You can use some existing images like the ones here: https://pantheon.corp.google.com/artifacts/docker/cloud-tpu-images/us/jax-ai-image/tpu?e=13802955&mods=allow_workbench_image_override

I believe this should work:

--docker-image=us-docker.pkg.dev/cloud-tpu-images/jax-ai-image/tpu:jax0.7.0-rev1

Follow the workload create instructions here: https://github.com/AI-Hypercomputer/xpk?tab=readme-ov-file#workload-create

Just see that it starts and is placed on the right nodes should be enough

samos123 · 2025-08-19T03:07:34Z

Tried creating an xpk pathways workload by running this

xpk workload create-pathways --headless --workload xpk-pw-headless --num-slices=1 --tpu-type=v6e-16 --cluster=stoelinga-axlearn --zone us-east5-b

Pods admitted and running, however pathways head node is ending up on a GKE TPU VM :(

NAME                                      READY   STATUS    RESTARTS   AGE   IP           NODE                    NOMINATED NODE   READINESS GATES
xpk-pw-headless-pathways-head-0-0-2n987   2/2     Running   0          90s   10.11.0.17   gke-tpu-70310290-d91m   <none>           <none>
xpk-pw-headless-worker-0-0-fdhhm          1/1     Running   0          72s   10.11.0.10   gke-tpu-97282632-x3pc   <none>           <none>
xpk-pw-headless-worker-0-1-x9cg2          1/1     Running   0          71s   10.11.0.12   gke-tpu-97282632-lzbp   <none>           <none>
xpk-pw-headless-worker-0-2-7wk2h          1/1     Running   0          71s   10.11.0.13   gke-tpu-97282632-lkhw   <none>           <none>
xpk-pw-headless-worker-0-3-hxf5r          1/1     Running   0          71s   10.11.0.14   gke-tpu-97282632-5d9b   <none>           <none>

I will need to work on this some more.

Otherwise the pathways head pod will not first get assigned to CPU only resource flavors.

samos123 · 2025-08-19T04:53:29Z

Verified that it works correctly now after fixing the order of the resource queues:

NAME                                      READY   STATUS    RESTARTS   AGE   IP            NODE                                         NOMINATED NODE   READINESS GATES
xpk-pw-headless-pathways-head-0-0-kzzxq   2/2     Running   0          14m   10.11.0.126   gke-stoelinga-axlearn-cpu-np-490fb457-dj9q   <none>           <none>
xpk-pw-headless-worker-0-0-4cgvj          1/1     Running   0          13m   10.11.0.13    gke-tpu-97282632-lkhw                        <none>           <none>
xpk-pw-headless-worker-0-1-vgv4r          1/1     Running   0          13m   10.11.0.12    gke-tpu-97282632-lzbp                        <none>           <none>
xpk-pw-headless-worker-0-2-hbphl          1/1     Running   0          13m   10.11.0.10    gke-tpu-97282632-x3pc                        <none>           <none>
xpk-pw-headless-worker-0-3-2lwj2          1/1     Running   0          13m   10.11.0.14    gke-tpu-97282632-5d9b                        <none>           <none>

@SujeethJinesh can you please give it another review?

SujeethJinesh

Thanks Sam!

RoshaniN

Thanks Sam!

RoshaniN · 2025-08-19T16:26:05Z

src/xpk/core/kueue.py

@@ -480,30 +479,50 @@ def install_kueue_crs(


 def get_kueue_covered_resources_config(
-    cluster_hardware_name, resource_type, total_chips
+    cluster_hardware_name, resource_type, total_chips, enable_pathways


I would retain pathways specific helpers / logic in pathways specific files such as src/xpk/core/pathways.py , if that's possible and avoid changing function headers.

LGTM to the change though.

I did consider that originally but I personally feel it's cleaner to keep all Kueue config inside kueue.py. So when I want to change the kueue config, I just have to look at kueue.py.

The way the code is now, I can easily understand what the covered resource config is going to look like without having to go between different functions and files.

So could we keep the PR as-is please?

samos123 · 2025-08-20T19:11:54Z

Closing this PR because it was made using a fork. This is the new PR: #600

RoshaniN reviewed Aug 5, 2025

View reviewed changes

samos123 marked this pull request as ready for review August 18, 2025 03:20

samos123 requested review from Obliviour, 44past4, sharabiani, pawloch00, BluValor, gcie, scaliby and FIoannides as code owners August 18, 2025 03:20

samos123 requested a review from RoshaniN August 18, 2025 03:21

Merge branch 'develop' into fix-kueue-for-axlearn

39a195d

SujeethJinesh added the release-bugfix label Aug 18, 2025

fix the ordering, pathways resources need to come first

899a24a

Otherwise the pathways head pod will not first get assigned to CPU only resource flavors.

SujeethJinesh approved these changes Aug 19, 2025

View reviewed changes

RoshaniN requested changes Aug 19, 2025

View reviewed changes

samos123 requested a review from RoshaniN August 19, 2025 19:59

samos123 mentioned this pull request Aug 20, 2025

feat: Pathways use single resource group #600

Open

samos123 closed this Aug 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Pathways use single resource group #574

feat: Pathways use single resource group #574

Uh oh!

samos123 commented Aug 3, 2025

Uh oh!

samos123 commented Aug 3, 2025

Uh oh!

SujeethJinesh commented Aug 5, 2025

Uh oh!

RoshaniN Aug 5, 2025

Uh oh!

samos123 Aug 18, 2025

Uh oh!

samos123 commented Aug 7, 2025 •

edited

Loading

Uh oh!

SujeethJinesh commented Aug 18, 2025

Uh oh!

samos123 commented Aug 19, 2025

Uh oh!

samos123 commented Aug 19, 2025

Uh oh!

SujeethJinesh left a comment

Uh oh!

RoshaniN left a comment

Uh oh!

RoshaniN Aug 19, 2025 •

edited

Loading

Uh oh!

samos123 Aug 19, 2025

Uh oh!

samos123 Aug 19, 2025

Uh oh!

samos123 commented Aug 20, 2025

Uh oh!

Uh oh!

feat: Pathways use single resource group #574

feat: Pathways use single resource group #574

Uh oh!

Conversation

samos123 commented Aug 3, 2025

Testing / Documentation

Uh oh!

samos123 commented Aug 3, 2025

Uh oh!

SujeethJinesh commented Aug 5, 2025

Uh oh!

RoshaniN Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

samos123 Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

samos123 commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SujeethJinesh commented Aug 18, 2025

Uh oh!

samos123 commented Aug 19, 2025

Uh oh!

samos123 commented Aug 19, 2025

Uh oh!

SujeethJinesh left a comment

Choose a reason for hiding this comment

Uh oh!

RoshaniN left a comment

Choose a reason for hiding this comment

Uh oh!

RoshaniN Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samos123 Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

samos123 Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

samos123 commented Aug 20, 2025

Uh oh!

Uh oh!

samos123 commented Aug 7, 2025 •

edited

Loading

RoshaniN Aug 19, 2025 •

edited

Loading