Skip to content

Conversation

samos123
Copy link
Collaborator

@samos123 samos123 commented Aug 3, 2025

Jobs requesting TPU resources may also have requests for CPU and memory. However when pathways is enabled, Kueue will not be able to admit such jobs since there is no cpu and memory quota.

This fix adds a very high number of CPU and memory for TPU/GPU resources and merges the pathways resource group with the accelerator resource group.

This also allows us to run AXLearn jobs without having to make changes manually.

Testing / Documentation

Testing details.

  • [ y/n ] Tests pass
  • [ y/n ] Appropriate changes to documentation are included in the PR

Jobs requesting TPU resources may also have requests for CPU and memory.
However when pathways is enabled, Kueue will not be able to admit such
jobs since there is no cpu and memory quota.

This fix adds a very high number of CPU and memory for TPU/GPU resources
and merges the pathways resource group with the accelerator resource
group.

This also allows us to run AXLearn jobs without having to make changes
manually.
@samos123
Copy link
Collaborator Author

samos123 commented Aug 3, 2025

@SujeethJinesh @Obliviour i didn't have time to test yet but this is basically what's needed for axlearn.

@SujeethJinesh
Copy link
Collaborator

This looks good to me, we just want to make sure that MaxText will work properly if it only requests for TPU and not TPU + CPU + Memory.

pathways_resources = ''
if enable_pathways:
pathways_resources = """
- name: cpu-user
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you can have two resource flavors with the same covered resources, IIUC. Please test before submitting this change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested, this works fine.

@samos123
Copy link
Collaborator Author

samos123 commented Aug 7, 2025

@RoshaniN this is working fine in internal clusters.

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  annotations:
  creationTimestamp: "2025-07-09T06:25:18Z"
  finalizers:
  - kueue.x-k8s.io/resource-in-use
  generation: 5
  name: cluster-queue
  resourceVersion: "1754542048065007021"
  uid: c1c2ecda-25af-4203-9b39-be78795c1dff
spec:
  flavorFungibility:
    whenCanBorrow: Borrow
    whenCanPreempt: TryNextFlavor
  namespaceSelector: {}
  preemption:
    borrowWithinCohort:
      policy: Never
    reclaimWithinCohort: Never
    withinClusterQueue: LowerPriority
  queueingStrategy: BestEffortFIFO
  resourceGroups:
  - coveredResources:
    - cpu
    - memory
    - google.com/tpu
    flavors:
    - name: cpu-user
      resources:
      - name: cpu
        nominalQuota: 480
      - name: memory
        nominalQuota: 2000G
      - name: google.com/tpu
        nominalQuota: 0
    - name: 4xv6e-256
      resources:
      - name: cpu
        nominalQuota: 99999999999
      - name: memory
        nominalQuota: 9999999Ti
      - name: google.com/tpu
        nominalQuota: 1024
  stopPolicy: None
status:
  admittedWorkloads: 4
  conditions:
  - lastTransitionTime: "2025-07-15T17:49:38Z"
    message: Can admit new workloads
    observedGeneration: 5
    reason: Ready
    status: "True"
    type: Active
  flavorsReservation:
  - name: cpu-user
    resources:
    - borrowed: "0"
      name: cpu
      total: "0"
    - borrowed: "0"
      name: google.com/tpu
      total: "0"
    - borrowed: "0"
      name: memory
      total: "0"
  - name: 4xv6e-256
    resources:
    - borrowed: "0"
      name: cpu
      total: "0"
    - borrowed: "0"
      name: google.com/tpu
      total: "1024"
    - borrowed: "0"
      name: memory
      total: "0"
  flavorsUsage:
  - name: cpu-user
    resources:
    - borrowed: "0"
      name: cpu
      total: "0"
    - borrowed: "0"
      name: google.com/tpu
      total: "0"
    - borrowed: "0"
      name: memory
      total: "0"
  - name: 4xv6e-256
    resources:
    - borrowed: "0"
      name: cpu
      total: "0"
    - borrowed: "0"
      name: google.com/tpu
      total: "1024"
    - borrowed: "0"
      name: memory
      total: "0"
  pendingWorkloads: 14
  reservingWorkloads: 4

@samos123 samos123 marked this pull request as ready for review August 18, 2025 03:20
@samos123 samos123 requested a review from RoshaniN August 18, 2025 03:21
@SujeethJinesh
Copy link
Collaborator

Instructions on running maxtext

  1. You can use some existing images like the ones here: https://pantheon.corp.google.com/artifacts/docker/cloud-tpu-images/us/jax-ai-image/tpu?e=13802955&mods=allow_workbench_image_override

I believe this should work:

--docker-image=us-docker.pkg.dev/cloud-tpu-images/jax-ai-image/tpu:jax0.7.0-rev1

  1. Follow the workload create instructions here: https://github.com/AI-Hypercomputer/xpk?tab=readme-ov-file#workload-create

Just see that it starts and is placed on the right nodes should be enough

@samos123
Copy link
Collaborator Author

Tried creating an xpk pathways workload by running this

xpk workload create-pathways --headless --workload xpk-pw-headless --num-slices=1 --tpu-type=v6e-16 --cluster=stoelinga-axlearn --zone us-east5-b

Pods admitted and running, however pathways head node is ending up on a GKE TPU VM :(

NAME                                      READY   STATUS    RESTARTS   AGE   IP           NODE                    NOMINATED NODE   READINESS GATES
xpk-pw-headless-pathways-head-0-0-2n987   2/2     Running   0          90s   10.11.0.17   gke-tpu-70310290-d91m   <none>           <none>
xpk-pw-headless-worker-0-0-fdhhm          1/1     Running   0          72s   10.11.0.10   gke-tpu-97282632-x3pc   <none>           <none>
xpk-pw-headless-worker-0-1-x9cg2          1/1     Running   0          71s   10.11.0.12   gke-tpu-97282632-lzbp   <none>           <none>
xpk-pw-headless-worker-0-2-7wk2h          1/1     Running   0          71s   10.11.0.13   gke-tpu-97282632-lkhw   <none>           <none>
xpk-pw-headless-worker-0-3-hxf5r          1/1     Running   0          71s   10.11.0.14   gke-tpu-97282632-5d9b   <none>           <none>

I will need to work on this some more.

Otherwise the pathways head pod will not first get assigned to CPU only
resource flavors.
@samos123
Copy link
Collaborator Author

Verified that it works correctly now after fixing the order of the resource queues:

NAME                                      READY   STATUS    RESTARTS   AGE   IP            NODE                                         NOMINATED NODE   READINESS GATES
xpk-pw-headless-pathways-head-0-0-kzzxq   2/2     Running   0          14m   10.11.0.126   gke-stoelinga-axlearn-cpu-np-490fb457-dj9q   <none>           <none>
xpk-pw-headless-worker-0-0-4cgvj          1/1     Running   0          13m   10.11.0.13    gke-tpu-97282632-lkhw                        <none>           <none>
xpk-pw-headless-worker-0-1-vgv4r          1/1     Running   0          13m   10.11.0.12    gke-tpu-97282632-lzbp                        <none>           <none>
xpk-pw-headless-worker-0-2-hbphl          1/1     Running   0          13m   10.11.0.10    gke-tpu-97282632-x3pc                        <none>           <none>
xpk-pw-headless-worker-0-3-2lwj2          1/1     Running   0          13m   10.11.0.14    gke-tpu-97282632-5d9b                        <none>           <none>

@SujeethJinesh can you please give it another review?

Copy link
Collaborator

@SujeethJinesh SujeethJinesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Sam!

Copy link
Collaborator

@RoshaniN RoshaniN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Sam!

@@ -480,30 +479,50 @@ def install_kueue_crs(


def get_kueue_covered_resources_config(
cluster_hardware_name, resource_type, total_chips
cluster_hardware_name, resource_type, total_chips, enable_pathways
Copy link
Collaborator

@RoshaniN RoshaniN Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would retain pathways specific helpers / logic in pathways specific files such as src/xpk/core/pathways.py , if that's possible and avoid changing function headers.

LGTM to the change though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider that originally but I personally feel it's cleaner to keep all Kueue config inside kueue.py. So when I want to change the kueue config, I just have to look at kueue.py.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way the code is now, I can easily understand what the covered resource config is going to look like without having to go between different functions and files.

So could we keep the PR as-is please?

@samos123
Copy link
Collaborator Author

Closing this PR because it was made using a fork. This is the new PR: #600

@samos123 samos123 closed this Aug 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants