-
Notifications
You must be signed in to change notification settings - Fork 393
Refactor launchers to add ray cluster mode #1531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm - minor comments and nits.
src/fairchem/core/_cli.py
Outdated
max_restarts=0, | ||
) | ||
elastic_launch(launch_config, _runner_wrapper)(cfg) | ||
elastic_launch(launch_config, slurm_launch.runner_wrapper)(cfg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, for local and slurm (use_ray=False), we use Submitit either calling it directly or using an executor for slurm. But when reading the code this is a bit unclear - took me a bit to piece it back together. Would be great to create an explicit local_launch
module too (which would be pretty empty, just have the runner_wrapper
in there).
use_ray
andRayClusterConfig
Previously we could only launch slurm jobs and local jobs with torch elastic, with adding Ray, we have the following configuration options:
The ray cluster is adapted from the fairray script. To launch a job, we start a head job and a worker job
If the job requests N nodes, then 1 node would be allocated for the head and N-1 nodes allocated to the worker job
We can launch jobs the same way:
fairchem -c ray_job.yaml
Added an example config (
uma_ray_demo
) to train uma using Ray:this is equivalent to running it on slurm via:
This should not affect any previous jobs or configs.