Replies: 1 comment
-
|
continue ...
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment

Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello I am trying to train T5 model using Pytorch lightning on a source length 4000 and target length of 1000 but when I am training it gives me Cuda of memory I am using sage maker instance ml.g4dn.4xlarge with single GPU and 64 GPU Memory. I have also tried gradient accumulation and checkpointing but they do not work and mixed precision leads training loss to nan. Can you please guide me as to what should I do ?

Sage maker instance
Error Msg : OutOfMemoryError: CUDA out of memory. Tried to allocate 136.00 MiB (GPU 0; 14.62 GiB total capacity; 13.11 GiB already allocated; 4.00 MiB free; 13.76 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Beta Was this translation helpful? Give feedback.
All reactions