Skip to content

mutlinodes launch script #723

@linnanwang

Description

@linnanwang

Is your feature request related to a problem? Please describe.
The current readme uses a customized script to setup everything, but clusters varies. So our provided script may not work in some cases so users have to dig into script to understand how to launch to fix the bug.

------------------------------here is what we have
Your existing model, dataset, training config...

step_scheduler:
  grad_acc_steps: 4
  num_epochs: 1

model:
  _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
  pretrained_model_name_or_path: meta-llama/Llama-3.2-1B

Add Slurm configuration
slurm:
  job_name: llm-finetune
  nodes: 1
  ntasks_per_node: 8
  time: 00:30:00
  account: your_account
  partition: gpu
  container_image: nvcr.io/nvidia/nemo:25.07
  gpus_per_node: 8 # This adds "#SBATCH --gpus-per-node=8" to the script
Optional: Add extra mount points if needed
  extra_mounts:
    - /lustre:/lustre


Describe the solution you'd like

A standard pytorch script to do multinode launching for examples, so far it is mostly limited to 1 node.

export MASTER_ADDR=node0.hostname   # master node's host/IP
export MASTER_PORT=29500
export NODE_RANK=0                  # node0 -> 0, node1 -> 1
torchrun --nnodes=2 --nproc_per_node=8 --node_rank=${NODE_RANK} \
  --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
  train.py --batch_size 32

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions