Skip to content

training libritts from scratch #49

@wangyuxuan11

Description

@wangyuxuan11

I'm trying to train zipvoice model on LibriTTS from scratch. The only difference from the script you provided in tutorial is that I only have 2 GPUs, and I changed the batch size (aka max duration) from 250 to 200. So my batch size is actually 400, while yours is 2000. Therefore, I reduced the initial learning rate from 0.02 to 0.004 accordingly. However, after training for 1200 batches, the loss started to become nan. I then further reduced the learning rate to 1e-4, but the loss still became nan during the training process. I would like to ask what the reason is, do you think there is anything wrong with my experiment?
Tks a lot~ 😊
This is a part of my log.
2025-07-16 15:27:26,157 WARNING [optim.py:600] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.485e-02 3.150e-02 3.570e-02 4.058e-02 1.745e+00, threshold=7.139e-02, percent-clipped=10.0
2025-07-16 15:27:38,359 WARNING [optim.py:616] (0/2) Scaling gradients by 0.07326850295066833, model_norm_threshold=0.0713903084397316
2025-07-16 15:27:53,121 WARNING [optim.py:616] (0/2) Scaling gradients by 0.03666429966688156, model_norm_threshold=0.0713903084397316
2025-07-16 15:27:56,001 WARNING [optim.py:616] (0/2) Scaling gradients by 0.02603791654109955, model_norm_threshold=0.0713903084397316
2025-07-16 15:28:00,257 WARNING [optim.py:616] (0/2) Scaling gradients by 0.1383640170097351, model_norm_threshold=0.0713903084397316
2025-07-16 15:28:19,181 WARNING [optim.py:616] (0/2) Scaling gradients by 0.10927493125200272, model_norm_threshold=0.0713903084397316
2025-07-16 15:28:32,751 INFO [train_zipvoice.py:683] (0/2) Epoch 1, batch 1049, global_batch_idx: 1050, batch size: 13, loss[loss=0.3753, over 17044.00 frames. ], tot_loss[loss=0.496, over 3481509.62 frames. ], cur_lr: 3.98e-03,
2025-07-16 15:28:50,219 WARNING [optim.py:616] (0/2) Scaling gradients by 0.38779377937316895, model_norm_threshold=0.0713903084397316
2025-07-16 15:28:51,739 WARNING [optim.py:616] (0/2) Scaling gradients by 0.023601435124874115, model_norm_threshold=0.0713903084397316
2025-07-16 15:28:58,573 WARNING [optim.py:616] (0/2) Scaling gradients by 0.48962411284446716, model_norm_threshold=0.0713903084397316
2025-07-16 15:29:13,301 WARNING [optim.py:616] (0/2) Scaling gradients by 0.23101931810379028, model_norm_threshold=0.0713903084397316
2025-07-16 15:29:28,513 WARNING [optim.py:616] (0/2) Scaling gradients by 0.17312481999397278, model_norm_threshold=0.0713903084397316
2025-07-16 15:29:31,375 WARNING [optim.py:616] (0/2) Scaling gradients by 0.1966768056154251, model_norm_threshold=0.0713903084397316
2025-07-16 15:29:40,906 INFO [train_zipvoice.py:683] (0/2) Epoch 1, batch 1099, global_batch_idx: 1100, batch size: 21, loss[loss=0.2665, over 18265.00 frames. ], tot_loss[loss=0.4521, over 3489922.90 frames. ], cur_lr: 3.98e-03,
2025-07-16 15:29:42,202 WARNING [optim.py:600] (0/2) Clipping_scale=2.0, grad-norm quartiles 2.925e-02 3.598e-02 4.158e-02 5.199e-02 3.025e+00, threshold=8.317e-02, percent-clipped=16.0
2025-07-16 15:30:47,501 INFO [train_zipvoice.py:683] (0/2) Epoch 1, batch 1149, global_batch_idx: 1150, batch size: 30, loss[loss=0.2047, over 17674.00 frames. ], tot_loss[loss=0.4049, over 3491031.33 frames. ], cur_lr: 3.98e-03,
2025-07-16 15:31:08,554 WARNING [optim.py:616] (0/2) Scaling gradients by 0.09235912561416626, model_norm_threshold=0.0831688940525055
2025-07-16 15:31:17,813 WARNING [optim.py:616] (0/2) Scaling gradients by 0.009989183396100998, model_norm_threshold=0.0831688940525055
2025-07-16 15:32:00,742 INFO [train_zipvoice.py:683] (0/2) Epoch 1, batch 1199, global_batch_idx: 1200, batch size: 109, loss[loss=nan, over 14125.00 frames. ], tot_loss[loss=nan, over 3487009.45 frames. ], cur_lr: 3.97e-03,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions