Gemini的zero方案可以初始化出gpt3级别的模型么？ #2479

yhcc · 2023-01-14T17:54:09Z

yhcc
Jan 14, 2023

这里https://github.com/hpcaitech/ColossalAI/blob/main/examples/language/gpt/gemini/train_gpt_demo.py 提供了一个基于Gemini的Zero方案，我尝试改变这个方案来直接支持gpt3级别的模型，但会导致oom问题。我看了下GeminiDDP的源码，似乎里面没有关于如何切分参数到不同卡的部分（不知道我是不是理解错了）。现在想通过Gemini这套方案训练一个GPT3级别的模型有什么推荐的方案吗？

feifeibear · 2023-01-15T03:01:59Z

feifeibear
Jan 15, 2023

shard init初始化原理：ColoInitContext一个参数一个参数初始化，先在所有进程上分配出global tensor，然后再切分成N份，每个进程只保留1/N数据。所以全部初始化完毕，每个进程只用1/N内存。

3 replies

yhcc Jan 15, 2023
Author

我之前一直理解的是ColoInitContext这个是用来初始化tp参数的；所以ColoInitContext这个context也会同时考虑到zero3中每个dp的卡保存一部分参数的逻辑吗？我感觉现在是可以初始化出一个很大的模型，但只要一forward似乎就oom了。

feifeibear Jan 15, 2023

Init过程和并行策略没有任何关系（设计上解耦的），尽管看起来切分和TP相似。
fwd oom应该是因为non model data的加入导致的。
你可以看一下init之后，fwd前的的内存占用分析一下，是不是需要调小batch size，或者试试其他placement policy。

taishiciR Mar 29, 2023

Init过程和并行策略没有任何关系（设计上解耦的），尽管看起来切分和TP相似。

fwd oom应该是因为non model data的加入导致的。
你可以看一下init之后，fwd前的的内存占用分析一下，是不是需要调小batch size，或者试试其他placement policy。
hi,feifeibear:请教一下，如果init一个大尺寸语言模型，参数单GPU加载不下，需要把参数无冗余分配到不同卡上再并行，shardinit是唯一的方式吗？有没有啥好的方法

hi feifeibear

binmakeswell · 2023-03-30T08:03:43Z

binmakeswell
Mar 30, 2023
Maintainer

Hi @yhcc @taishiciR For shard init, you can refer to here
https://github.com/hpcaitech/ColossalAI/blob/main/examples/language/gpt/gemini/train_gpt_demo.py#L229
We are working with a better version
#3124

1 reply

taishiciR Mar 31, 2023

@binmakeswell thanks. If the [lazy init] support nn.module with lora？
For I came acorss a runtime error when trying to shard init an actor in chat example(step3) with lora,an error reporting in the forward stage after lora
which describe in the discussion-3303
#3303

andrew600-sudo · 2023-04-11T13:17:16Z

andrew600-sudo
Apr 11, 2023

https://tendi.ro/

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gemini的zero方案可以初始化出gpt3级别的模型么？ #2479

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Gemini的zero方案可以初始化出gpt3级别的模型么？ #2479

Uh oh!

Uh oh!

yhcc Jan 14, 2023

Replies: 3 comments · 4 replies

Uh oh!

feifeibear Jan 15, 2023

Uh oh!

yhcc Jan 15, 2023 Author

Uh oh!

feifeibear Jan 15, 2023

Uh oh!

taishiciR Mar 29, 2023

Uh oh!

binmakeswell Mar 30, 2023 Maintainer

Uh oh!

Uh oh!

taishiciR Mar 31, 2023

Uh oh!

andrew600-sudo Apr 11, 2023

yhcc
Jan 14, 2023

Replies: 3 comments 4 replies

feifeibear
Jan 15, 2023

yhcc Jan 15, 2023
Author

binmakeswell
Mar 30, 2023
Maintainer

andrew600-sudo
Apr 11, 2023