Skip to content

Conversation

rayg1234
Copy link
Contributor

@rayg1234 rayg1234 commented Sep 29, 2025

  • This add a new mode to launch Ray jobs with fairchem: use_ray and RayClusterConfig
  • Refactor and simplifies the _cli so that the main launching logic is separated and moved to the launchers module

Previously we could only launch slurm jobs and local jobs with torch elastic, with adding Ray, we have the following configuration options:

image

The ray cluster is adapted from the fairray script. To launch a job, we start a head job and a worker job

  • The head node runs the Ray gcs server and the dashboard etc. It also runs the driver program
  • The worker nodes to just supply the resources for Ray
    If the job requests N nodes, then 1 node would be allocated for the head and N-1 nodes allocated to the worker job

We can launch jobs the same way:
fairchem -c ray_job.yaml

Added an example config (uma_ray_demo) to train uma using Ray:

fairchem -c configs/uma/training_release/uma_ray_demo.yaml cluster=h100 dataset=uma_debug job.scheduler.slurm.qos=h100_ocp_high job.scheduler.num_nodes=2 max_atoms=100

this is equivalent to running it on slurm via:

fairchem -c configs/uma/training_release/uma_sm_direct_pretrain.yaml cluster=h100 dataset=uma_debug job.scheduler.slurm.qos=h100_ocp_high job.scheduler.num_nodes=2 max_atoms=100

This should not affect any previous jobs or configs.

@meta-cla meta-cla bot added the cla signed label Sep 29, 2025
@rayg1234 rayg1234 added enhancement New feature or request minor Minor version release labels Sep 30, 2025
@rayg1234 rayg1234 requested a review from lbluque September 30, 2025 19:10
@rayg1234 rayg1234 changed the title [WIP] refactor launchers to add ray cluster mode Refactor launchers to add ray cluster mode Sep 30, 2025
@rayg1234 rayg1234 requested a review from misko October 1, 2025 16:40
@misko misko self-requested a review October 8, 2025 16:05
misko
misko previously approved these changes Oct 8, 2025
Copy link
Contributor

@misko misko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@lbluque lbluque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - minor comments and nits.

max_restarts=0,
)
elastic_launch(launch_config, _runner_wrapper)(cfg)
elastic_launch(launch_config, slurm_launch.runner_wrapper)(cfg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, for local and slurm (use_ray=False), we use Submitit either calling it directly or using an executor for slurm. But when reading the code this is a bit unclear - took me a bit to piece it back together. Would be great to create an explicit local_launch module too (which would be pretty empty, just have the runner_wrapper in there).

lbluque
lbluque previously approved these changes Oct 8, 2025
@rayg1234 rayg1234 enabled auto-merge October 8, 2025 19:06
@rayg1234 rayg1234 added this pull request to the merge queue Oct 8, 2025
Merged via the queue into main with commit 04ba469 Oct 8, 2025
10 checks passed
@rayg1234 rayg1234 deleted the rgao_ray_launcher branch October 8, 2025 23:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed enhancement New feature or request minor Minor version release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants