How to use pipeline parallelism in serve a bloom model? #3013

gaoxt1983 · 2023-03-06T11:48:14Z

gaoxt1983
Mar 6, 2023

I have a bloom 175b pretrained model. I want to serve this model with EnergonAI in a one node machine with 4 A100 GPUs. So I have modified example/bloom/run.sh:

CUDA_VISIBLE_DEVICES_set_n_least_memory_usage() {
    local n=${1:-"9999"}
    echo "GPU Memory Usage:"
    local FIRST_N_GPU_IDS=$(nvidia-smi --query-gpu=memory.used --format=csv \
        | tail -n +2 \
        | nl -v 0 \
        | tee /dev/tty \
        | sort -g -k 2 \
        | awk '{print $1}' \
        | head -n $n)
    export CUDA_VISIBLE_DEVICES=$(echo $FIRST_N_GPU_IDS | sed 's/ /,/g')
    echo "Now CUDA_VISIBLE_DEVICES is set to:"
    echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
}


export GPU_NUM=4

export DATASET=<PATH>
CUDA_VISIBLE_DEVICES_set_n_least_memory_usage ${GPU_NUM}

# param list
# --name :Name Path
# --tp: (int) GPU_NUM, default=1
# --http_host: (x.x.x.x)  your IP address, default=0.0.0.0
# --http_port: (xxxx) your port, default=7070
# --dtype:(str) use int8-quant or not ["fp16", "int8"], default="fp16"
# --max_batchsize:(int) limitation of batchsize, default=1
# --random_init:(bool) random init or not(if you don't have whole model data), default=False
# --random_model_size:(str) size of random init model,["560m", "7b1", "175b"],default="560m"


python server.py --tp 1 --name $DATASET --dtype "fp16" --max_batch_size 4 --random_model_size "175b" --pp_size=4 --pipe_size=4 --http_port=8080

What I have monitored later was the 4 GPUs were quite idle. the processes which I monitored were like this:

[python] <defunct>

So what have I done wrong, and what should I do to achieve pipeline parallelism?

binmakeswell · 2023-03-07T05:15:23Z

binmakeswell
Mar 7, 2023
Maintainer

Hi @gaoxt1983 In fact, as the BLOOM example demonstrates, we recommend using TP. Because PP is inefficient in generating tasks due to bubble.

3 replies

gaoxt1983 Mar 7, 2023
Author

Hi @gaoxt1983 In fact, as the BLOOM example demonstrates, we recommend using TP. Because PP is inefficient in generating tasks due to bubble.

Actually, later in yesterday, I finally have found that python processes which "defunct" are due to lack of memory. But we don't have such huge memory(single process often consume more than 400G memory), is it normal? What should I use to run this example correctly in my node?

the node's hardware is:
- CPU * 176
- memory: 800G
- 80G A100 * 8

binmakeswell Mar 7, 2023
Maintainer

Hi @gaoxt1983 In fact, as the BLOOM example demonstrates, we recommend using TP. Because PP is inefficient in generating tasks due to bubble.

Actually, later in yesterday, I finally have found that python processes which "defunct" are due to lack of memory. But we don't have such huge memory(single process often consume more than 400G memory), is it normal? What should I use to run this example correctly in my node?

the node's hardware is: - CPU * 176 - memory: 800G - 80G A100 * 8

That's interesting, can you run the original code?
https://github.com/hpcaitech/EnergonAI/tree/main/examples/bloom
We have tested it on a sever with 8*80GB A100 + 512GB CPU memory.
400+GB CPU memory should be enough.

gaoxt1983 Mar 7, 2023
Author

@binmakeswell The original code of "run.sh" runs smoothly. Without even a warning. But it's model using 7b1 not 175b, and for worst, it random init weight. I will try to run it with a little modification, which I would modify the code to "175b" and not random init

gaoxt1983 · 2023-03-08T07:37:11Z

gaoxt1983
Mar 8, 2023
Author

Is it normal that I generate one token for 100ms~120ms, in a node that holds 8 A100?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to use pipeline parallelism in serve a bloom model? #3013

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to use pipeline parallelism in serve a bloom model? #3013

Uh oh!

gaoxt1983 Mar 6, 2023

Replies: 2 comments · 3 replies

Uh oh!

binmakeswell Mar 7, 2023 Maintainer

Uh oh!

gaoxt1983 Mar 7, 2023 Author

Uh oh!

binmakeswell Mar 7, 2023 Maintainer

Uh oh!

Uh oh!

gaoxt1983 Mar 7, 2023 Author

Uh oh!

gaoxt1983 Mar 8, 2023 Author

gaoxt1983
Mar 6, 2023

Replies: 2 comments 3 replies

binmakeswell
Mar 7, 2023
Maintainer

gaoxt1983 Mar 7, 2023
Author

binmakeswell Mar 7, 2023
Maintainer

gaoxt1983 Mar 7, 2023
Author

gaoxt1983
Mar 8, 2023
Author