在训练过程中产生OOM #2907
Unanswered
yangjianxin1
asked this question in
Community | Q&A
在训练过程中产生OOM
#2907
Replies: 1 comment
-
This problem may caused by memory fragments on CUDA. This problem is normal since PyTorch uses a simple caching allocator. We may alleviate this problem in the furture. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
使用ZeRO+Gemin的方案训练模型GPT2模型,开始的时候能够正常训练,并且显存还有剩余,训练到若干个step就会报错OOM,请问是否有优化的手段,比如自动进行垃圾回收。
CUDA out of memory. Tried to allocate 506.00 MiB (GPU 0; 31.75 GiB total capacity; 28.92 GiB already allocated; 10.00 MiB free; 30.43 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
temp = optim_chunk.cpu_shard.to(get_current_device())
RuntimeError: CUDA out of memory. Tried to allocate 506.00 MiB (GPU 1; 31.75 GiB total capacity; 28.92 GiB already allocated; 14.00 MiB free; 30.43 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
0%| | 35/125002 [02:28<147:40:56, 4.25s/it]
0%| | 35/125002 [02:29<147:49:53, 4.26s/it]
Beta Was this translation helpful? Give feedback.
All reactions