在训练过程中产生OOM #2907

yangjianxin1 · 2023-02-25T13:30:35Z

yangjianxin1
Feb 25, 2023

使用ZeRO+Gemin的方案训练模型GPT2模型，开始的时候能够正常训练，并且显存还有剩余，训练到若干个step就会报错OOM，请问是否有优化的手段，比如自动进行垃圾回收。

CUDA out of memory. Tried to allocate 506.00 MiB (GPU 0; 31.75 GiB total capacity; 28.92 GiB already allocated; 10.00 MiB free; 30.43 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
temp = optim_chunk.cpu_shard.to(get_current_device())
RuntimeError: CUDA out of memory. Tried to allocate 506.00 MiB (GPU 1; 31.75 GiB total capacity; 28.92 GiB already allocated; 14.00 MiB free; 30.43 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
0%| | 35/125002 [02:28<147:40:56, 4.25s/it]
0%| | 35/125002 [02:29<147:49:53, 4.26s/it]

1SAA · 2023-03-02T09:54:06Z

1SAA
Mar 2, 2023

Hi @yangjianxin1

This problem may caused by memory fragments on CUDA. This problem is normal since PyTorch uses a simple caching allocator. We may alleviate this problem in the furture.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

在训练过程中产生OOM #2907

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

在训练过程中产生OOM #2907

Uh oh!

yangjianxin1 Feb 25, 2023

Replies: 1 comment

Uh oh!

1SAA Mar 2, 2023

yangjianxin1
Feb 25, 2023

1SAA
Mar 2, 2023