Skip to content

[Core] Add default Ray Node labels at Node init #53360

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 39 commits into from
Jul 29, 2025

Conversation

ryanaoleary
Copy link
Contributor

Why are these changes needed?

This PR adds support for populating several default Ray node labels (described here) in the Ray runtime environment when a node is initialized. This change will help support autoscaling with the Label Selector API. This PR is related to ray-project/kuberay#3699 which passes several environment variables from the K8s stack which are used to set ray.io/ labels. I'll leave a comment on this PR with manual tests showing the ray.io/accelerator-type and ray.io/availability-zone labels getting set.

Related issue number

#51564

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ryanaoleary
Copy link
Contributor Author

cc: @MengjinYan

Copy link
Collaborator

@MengjinYan MengjinYan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic change looks good to me! Just one comment about testing.

@MengjinYan MengjinYan added the go add ONLY when ready to merge, run all tests label May 29, 2025
@MengjinYan MengjinYan self-assigned this May 29, 2025
@MengjinYan
Copy link
Collaborator

@edoakes @jjyao Wondering if you can help take a look and merge the change to get default set the default labels for ray nodes? Thanks!

@jjyao jjyao changed the title Add default Ray Node labels at Node init [Core] Add default Ray Node labels at Node init Jun 5, 2025
@ryanaoleary ryanaoleary requested a review from edoakes June 12, 2025 01:30
@ryanaoleary ryanaoleary requested a review from a team as a code owner June 12, 2025 08:44
@edoakes
Copy link
Collaborator

edoakes commented Jun 12, 2025

CI tests failures @ryanaoleary

@ryanaoleary
Copy link
Contributor Author

CI tests failures @ryanaoleary

@edoakes Think I fixed the CI failure with bd51036. test_exit_logging was failing because the calls to AcceleratorManager in Node init would add the Traceback to the driver log, when none was expected. I updated the test to patch calls to _get_default_ray_node_labels so that these logs could be ignored for the test. An alternative would be to update calls like this in each AcceleratorManager definition to log but not add to the traceback. I also added a try: except block when we make calls to AcceleratorManager in the new code to ensure Node init doesn't fail for whatever reason when adding the default labels.

@edoakes
Copy link
Collaborator

edoakes commented Jun 13, 2025

The solution looks worrisome/hacky. What causes the accelerator manager calls to fail on startup? We might be breaking some kind of assumption there.

@jjyao PTAL at the usage of accelerator manager here

@ryanaoleary
Copy link
Contributor Author

The solution looks worrisome/hacky. What causes the accelerator manager calls to fail on startup? We might be breaking some kind of assumption there.

@jjyao PTAL at the usage of accelerator manager here

It'd raise an exception here due to missing dependencies such as ModuleNotFoundError: No module named 'acl' for NPU. Detecting the accelerator type would fail, but node startup would still succeed as expected. I added the try: except just to be safe since the code is on a critical path, but maybe it's not needed, since it's not used in other places AcceleratorManager is used. Maybe a better solution is just to logger.debug when we see a ModuleNotFoundError rather than raise an exception.

@jjyao jjyao self-assigned this Jun 18, 2025
@ryanaoleary ryanaoleary requested a review from jjyao June 25, 2025 00:24
ryanaoleary and others added 25 commits July 29, 2025 17:42
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: Mengjin Yan <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: Mengjin Yan <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
…nd move record hardware usage to node.py

Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
@ryanaoleary
Copy link
Contributor Author

@MengjinYan rebased and re-pushed since there were some changes to resource_spec.py in master which gets deleted in this PR. I made sure the same change is included in this PR (changed import for from ray._common.usage import usage_lib in node.py to match).

@jjyao jjyao merged commit 8e0d494 into ray-project:master Jul 29, 2025
5 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in K8s and Ray (go/k8s-ray-oss) Jul 29, 2025
krishnakalyan3 pushed a commit to krishnakalyan3/ray that referenced this pull request Jul 30, 2025
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
Co-authored-by: Mengjin Yan <[email protected]>
Signed-off-by: Krishna Kalyan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests k8s-proj K8s and Ray OSS
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants