mutlinodes launch script

**Is your feature request related to a problem? Please describe.**
The current readme uses a customized script to setup everything, but clusters varies. So our provided script may not work in some cases so users have to dig into script to understand how to launch to fix the bug.

------------------------------here is what we have
 Your existing model, dataset, training config...
```
step_scheduler:
  grad_acc_steps: 4
  num_epochs: 1

model:
  _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
  pretrained_model_name_or_path: meta-llama/Llama-3.2-1B

Add Slurm configuration
slurm:
  job_name: llm-finetune
  nodes: 1
  ntasks_per_node: 8
  time: 00:30:00
  account: your_account
  partition: gpu
  container_image: nvcr.io/nvidia/nemo:25.07
  gpus_per_node: 8 # This adds "#SBATCH --gpus-per-node=8" to the script
Optional: Add extra mount points if needed
  extra_mounts:
    - /lustre:/lustre

```

-------------------------------------------

**Describe the solution you'd like**

A standard pytorch script to do multinode launching for examples, so far it is mostly limited to 1 node.

```
export MASTER_ADDR=node0.hostname   # master node's host/IP
export MASTER_PORT=29500
export NODE_RANK=0                  # node0 -> 0, node1 -> 1
torchrun --nnodes=2 --nproc_per_node=8 --node_rank=${NODE_RANK} \
  --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
  train.py --batch_size 32
```




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mutlinodes launch script #723

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

mutlinodes launch script #723

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions