Failed to Save Checkpoint at Final Training Step.

- **Version**

Python 3.10.15
Torch 2.5.1
A6000(48G*4)
CUDA Version 12.2
DeepSpeed 0.15.4
Accelerate 1.1.1

- **Description**

While running train_control.py, the process fails during the final checkpoint saving step. The logs indicate that an NCCL ALLREDUCE operation timed out after running for approximately 30 minutes. As a result, the process is terminated with a DistBackendError.

- **Error Logs**

`Steps: 100%|████████████████████████████████████████████| 6928/6928 [37:28<00:00, 11.45s/it, lr=2e-5, step_loss=0.125]
12/02/2024 16:03:31 - INFO - accelerate.accelerator - Saving current state to output_dir/checkpoint-6928
12/02/2024 16:03:31 - INFO - accelerate.accelerator - Saving DeepSpeed Model and Optimizer
[rank0]:[E1202 16:33:31.475880828 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: Wor
kNCCL(SeqNum=140, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800093 milliseconds before ti
ming out.
[rank0]:[E1202 16:33:31.828915902 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 0] Exception (either 
an error or timeout) detected by watchdog at work: 140, last enqueued NCCL work: 140, last completed NCCL work: 139.
[2024-12-02 16:33:32,111] [INFO] [logging.py:128:log_dist] [Rank 0] [Torch] Checkpoint pytorch_model is about to be sa
ved!
[rank0]:[E1202 16:33:32.374203239 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 0] Timeout at NCCL wo
rk: 140, last enqueued NCCL work: 140, last completed NCCL work: 139.
[rank0]:[E1202 16:33:32.374226255 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Du
e to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E1202 16:33:32.374231830 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the en
tire process down.
[rank0]:[E1202 16:33:32.388427390 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watc
hdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=140, OpT
ype=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800093 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first)
:
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6e7d65e446 in /data2/akko/anaconda3/envs/cog
videox/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000
l> > >) + 0x282 (0x7f6e7e971772 in /data2/akko/anaconda3/envs/cogvideox/lib/python3.10/site-packages/torch/lib/libtorc
h_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f6e7e978bb3 in /data2/akko/anaconda3/envs/cogvideox/li
b/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f6e7e97a61d in /data2/akko/anaconda3/envs/cogvideox/l
ib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f6ec73075c0 in /data2/akko/anaconda3/envs/cogvideox/lib/python3.10/site-pac
kages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7f6eca106609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f6ec9ed1353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Wa
tchdog caught collective operation timeout: WorkNCCL(SeqNum=140, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=
1800000) ran for 1800093 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first)
:
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6e7d65e446 in /data2/akko/anaconda3/envs/cog
videox/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000
l> > >) + 0x282 (0x7f6e7e971772 in /data2/akko/anaconda3/envs/cogvideox/lib/python3.10/site-packages/torch/lib/libtorc
h_cuda.so)`
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f6e7e978bb3 in /data2/akko/anaconda3/envs/cogvideox/li
b/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f6e7e97a61d in /data2/akko/anaconda3/envs/cogvideox/l
ib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f6ec73075c0 in /data2/akko/anaconda3/envs/cogvideox/lib/python3.10/site-pac
kages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7f6eca106609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f6ec9ed1353 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call f
irst):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6e7d65e446 in /data2/akko/anaconda3/envs/cog
videox/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x7f6e7e5e771b in /data2/akko/anaconda3/envs/cogvideox/lib/python3.10/site-pa
ckages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7f6ec73075c0 in /data2/akko/anaconda3/envs/cogvideox/lib/python3.10/site-pac
kages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x8609 (0x7f6eca106609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f6ec9ed1353 in /lib/x86_64-linux-gnu/libc.so.6)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Failed to Save Checkpoint at Final Training Step. #89

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Failed to Save Checkpoint at Final Training Step. #89

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions