-
Notifications
You must be signed in to change notification settings - Fork 102
Description
I'm trying to train zipvoice model on LibriTTS from scratch. The only difference from the script you provided in tutorial is that I only have 2 GPUs, and I changed the batch size (aka max duration) from 250 to 200. So my batch size is actually 400, while yours is 2000. Therefore, I reduced the initial learning rate from 0.02 to 0.004 accordingly. However, after training for 1200 batches, the loss started to become nan. I then further reduced the learning rate to 1e-4, but the loss still became nan during the training process. I would like to ask what the reason is, do you think there is anything wrong with my experiment?
Tks a lot~ 😊
This is a part of my log.
2025-07-16 15:27:26,157 WARNING [optim.py:600] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.485e-02 3.150e-02 3.570e-02 4.058e-02 1.745e+00, threshold=7.139e-02, percent-clipped=10.0
2025-07-16 15:27:38,359 WARNING [optim.py:616] (0/2) Scaling gradients by 0.07326850295066833, model_norm_threshold=0.0713903084397316
2025-07-16 15:27:53,121 WARNING [optim.py:616] (0/2) Scaling gradients by 0.03666429966688156, model_norm_threshold=0.0713903084397316
2025-07-16 15:27:56,001 WARNING [optim.py:616] (0/2) Scaling gradients by 0.02603791654109955, model_norm_threshold=0.0713903084397316
2025-07-16 15:28:00,257 WARNING [optim.py:616] (0/2) Scaling gradients by 0.1383640170097351, model_norm_threshold=0.0713903084397316
2025-07-16 15:28:19,181 WARNING [optim.py:616] (0/2) Scaling gradients by 0.10927493125200272, model_norm_threshold=0.0713903084397316
2025-07-16 15:28:32,751 INFO [train_zipvoice.py:683] (0/2) Epoch 1, batch 1049, global_batch_idx: 1050, batch size: 13, loss[loss=0.3753, over 17044.00 frames. ], tot_loss[loss=0.496, over 3481509.62 frames. ], cur_lr: 3.98e-03,
2025-07-16 15:28:50,219 WARNING [optim.py:616] (0/2) Scaling gradients by 0.38779377937316895, model_norm_threshold=0.0713903084397316
2025-07-16 15:28:51,739 WARNING [optim.py:616] (0/2) Scaling gradients by 0.023601435124874115, model_norm_threshold=0.0713903084397316
2025-07-16 15:28:58,573 WARNING [optim.py:616] (0/2) Scaling gradients by 0.48962411284446716, model_norm_threshold=0.0713903084397316
2025-07-16 15:29:13,301 WARNING [optim.py:616] (0/2) Scaling gradients by 0.23101931810379028, model_norm_threshold=0.0713903084397316
2025-07-16 15:29:28,513 WARNING [optim.py:616] (0/2) Scaling gradients by 0.17312481999397278, model_norm_threshold=0.0713903084397316
2025-07-16 15:29:31,375 WARNING [optim.py:616] (0/2) Scaling gradients by 0.1966768056154251, model_norm_threshold=0.0713903084397316
2025-07-16 15:29:40,906 INFO [train_zipvoice.py:683] (0/2) Epoch 1, batch 1099, global_batch_idx: 1100, batch size: 21, loss[loss=0.2665, over 18265.00 frames. ], tot_loss[loss=0.4521, over 3489922.90 frames. ], cur_lr: 3.98e-03,
2025-07-16 15:29:42,202 WARNING [optim.py:600] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.925e-02 3.598e-02 4.158e-02 5.199e-02 3.025e+00, threshold=8.317e-02, percent-clipped=16.0
2025-07-16 15:30:47,501 INFO [train_zipvoice.py:683] (0/2) Epoch 1, batch 1149, global_batch_idx: 1150, batch size: 30, loss[loss=0.2047, over 17674.00 frames. ], tot_loss[loss=0.4049, over 3491031.33 frames. ], cur_lr: 3.98e-03,
2025-07-16 15:31:08,554 WARNING [optim.py:616] (0/2) Scaling gradients by 0.09235912561416626, model_norm_threshold=0.0831688940525055
2025-07-16 15:31:17,813 WARNING [optim.py:616] (0/2) Scaling gradients by 0.009989183396100998, model_norm_threshold=0.0831688940525055
2025-07-16 15:32:00,742 INFO [train_zipvoice.py:683] (0/2) Epoch 1, batch 1199, global_batch_idx: 1200, batch size: 109, loss[loss=nan, over 14125.00 frames. ], tot_loss[loss=nan, over 3487009.45 frames. ], cur_lr: 3.97e-03,