-
Notifications
You must be signed in to change notification settings - Fork 17
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem? Please describe.
The current readme uses a customized script to setup everything, but clusters varies. So our provided script may not work in some cases so users have to dig into script to understand how to launch to fix the bug.
------------------------------here is what we have
Your existing model, dataset, training config...
step_scheduler:
grad_acc_steps: 4
num_epochs: 1
model:
_target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
pretrained_model_name_or_path: meta-llama/Llama-3.2-1B
Add Slurm configuration
slurm:
job_name: llm-finetune
nodes: 1
ntasks_per_node: 8
time: 00:30:00
account: your_account
partition: gpu
container_image: nvcr.io/nvidia/nemo:25.07
gpus_per_node: 8 # This adds "#SBATCH --gpus-per-node=8" to the script
Optional: Add extra mount points if needed
extra_mounts:
- /lustre:/lustre
Describe the solution you'd like
A standard pytorch script to do multinode launching for examples, so far it is mostly limited to 1 node.
export MASTER_ADDR=node0.hostname # master node's host/IP
export MASTER_PORT=29500
export NODE_RANK=0 # node0 -> 0, node1 -> 1
torchrun --nnodes=2 --nproc_per_node=8 --node_rank=${NODE_RANK} \
--rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
train.py --batch_size 32
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request