CUDA build performing very poorly on A100 (very long prompt eval time) #3874

trzy · 2023-10-31T21:22:32Z

trzy
Oct 31, 2023

Hi,

I've built llama.cpp with make LLAMA_CUBLAS=1. I'm using server and seeing incredibly slow performance that makes me suspect something is amiss. I'm running on an A100 with 80GB RAM (Runpod.io). The logs seem to indicate that the GPU is being utilized:

root@d8f002add17a:/workspace/llama.cpp# ./server --mmproj models/llava-1.5-7b/mmproj-model-f16.gguf -m models/llava-1.5-7b/ggml-model-q5_k.gguf 
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0
{"timestamp":1698786679,"level":"INFO","function":"main","line":2212,"message":"build info","build":1448,"commit":"238657d"}
{"timestamp":1698786679,"level":"INFO","function":"main","line":2215,"message":"system info","n_threads":96,"n_threads_batch":-1,"total_threads":192,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}
Multi Modal Mode Enabledclip_model_load: model name:   openai/clip-vit-large-patch14-336
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 2
clip_model_load: alignment:    32
clip_model_load: n_tensors:    377
clip_model_load: n_kv:         18
clip_model_load: ftype:        f16

clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: model size:     595.61 MB
clip_model_load: metadata size:  0.13 MB
clip_model_load: total allocated memory: 201.27 MB
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from models/llava-1.5-7b/ggml-model-q5_k.gguf (version GGUF V2)

...

llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q5_K - Medium
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 4.45 GiB (5.68 BPW) 
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.10 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 4560.96 MB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/35 layers to GPU
llm_load_tensors: VRAM used: 0.00 MB
..................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 1024.00 MB
llama_new_context_with_model: compute buffer total size = 162.13 MB
llama_new_context_with_model: VRAM scratch buffer: 156.00 MB
llama_new_context_with_model: total VRAM used: 156.00 MB (model: 0.00 MB, context: 156.00 MB)
Available slots:
 -> Slot 0 - max context: 2048

llama server listening at http://127.0.0.1:8080

{"timestamp":1698786685,"level":"INFO","function":"main","line":2492,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080}
all slots are idle and system prompt is empty, clear the KV cache

\slot 0 - image loaded [id: 0] resolution (1600 x 1067)
slot 0 is processing [task id: 0]
slot 0 : kv cache rm - [0, end)
slot 0 - encoding image [id: 0]

print_timings: prompt eval time =   41676.96 ms /     1 tokens (41676.96 ms per token,     0.02 tokens per second)
print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token, 250000.00 tokens per second)
print_timings:       total time =   41676.96 ms
slot 0 released (3 tokens in cache)
{"timestamp":1698786731,"level":"INFO","function":"log_server_request","line":2156,"message":"request","remote_addr":"127.0.0.1","remote_port":55292,"status":200,"method":"POST","path":"/completion","params":{}}

prompt eval time is 0.02 tokens/second.

Any help debugging this or understanding the timing breakdown and why this system isn't performing would be very helpful. On my M1 MacBook w/ 32GB RAM, it absolutely screams, but it's using an entirely different backend (Metal).

Thank you,

Bart

slaren · 2023-10-31T21:39:31Z

slaren
Oct 31, 2023
Maintainer

You could try running it with nsight systems to see where it is spending all this time. Does this also happen with main?

2 replies

slaren Oct 31, 2023
Maintainer

Also, it seems that you not offloading any layers to the GPU. Try adding -ngl 99 to the command line. However, that performance is very bad even for CPU.

Galunid Oct 31, 2023
Collaborator

It's using LLaVA and encoding image, so I think performance here is about what I'd expect (for CPU).

taikai-zz · 2023-10-31T23:38:50Z

taikai-zz
Oct 31, 2023

llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/35 layers to GPU
llm_load_tensors: VRAM used: 0.00 MB
The program was not loaded into the GPU, -ngl 35 Can be solved

0 replies

trzy · 2023-10-31T23:55:58Z

trzy
Oct 31, 2023
Author

Thanks for the help -- -ngl definitely helps but I see now that CLIP image encoding is really slow. Switching to llava (rather than server) to compare on both my Macbook and this cloud system, I see 1.82 ms/patch on Mac and 57.90 ms/patch on the cloud system (AMD EPYC 7643 48-Core).

Is CLIP encoding accelerated on Metal systems?

3 replies

ggerganov Nov 1, 2023
Maintainer

CLIP does not support GPU yet. On Mac it will run on the CPU and utilize CBLAS which is Apple's built-in library that I believe utilizes the AMX coprocessor bringing respectable performance even without GPU support.

How is the performance if you explicitly set -t 8 in the CLI args?

trzy Nov 2, 2023
Author

That helps a lot! Thank you, Georgi!

At the risk of really embarrassing myself here, I did some very crude benchmarking on that A100 system today. I wanted to compare the LLaVA repo (the original PyTorch code) to llama.cpp. I whipped up a very quick test using 4 images and a total of 18 prompts, repeated over and over for a total of 500 invocations.

LLaVA doesn't return a detailed breakdown of timings and I didn't have time to tear it apart and add that, so instead I focused on "characters/second" rather than tokens. It's a really bad metric, I know, and I know it doesn't make sense to distribute the latency over characters (number of chars in input prompt + output sequence), but it's a rough measure.

I think llama.cpp with -t 8 wins. Interestingly, the performance is much more consistent across all prompt lengths. There is some sort of difference in initial params that I couldn't nail down, so the total number of chars involved in these tests differs, but if you normalize total time to the total number of characters processed (in and out), llama.cpp seems to win. The PyTorch version has a much larger variance in performance (sometimes inference is much faster, on short sequences, sometimes much slower):

llama.cpp:

Timing Results:
  Mean   = 2.00 s, 269.83 chars/s
  Median = 1.79 s, 280.21 chars/s
  1%     = 0.75 s, 171.42 chars/s
  5%     = 0.77 s, 203.14 chars/s
  10%    = 0.79 s, 224.92 chars/s
  90%    = 3.46 s, 305.02 chars/s
  95%    = 4.29 s, 312.31 chars/s
  99%    = 6.09 s, 327.95 chars/s
  Total  = 1024.42 s
  Chars  = 266529

LLaVA (PyTorch)

Timing Results:
  Mean   = 2.94 s, 439.98 chars/s
  Median = 2.48 s, 230.51 chars/s
  1%     = 0.16 s, 91.98 chars/s
  5%     = 0.17 s, 123.17 chars/s
  10%    = 0.19 s, 151.61 chars/s
  90%    = 6.73 s, 1255.76 chars/s
  95%    = 8.84 s, 1358.57 chars/s
  99%    = 11.48 s, 1469.52 chars/s
  Total  = 1509.42 s
  Chars  = 303705

ggerganov Nov 2, 2023
Maintainer

Probably instead of running server, you can use the command-line tool llava.
It gives more detailed performance stats although it will load and unload the model every time you run it:

./llava -m models/llava-7b-v1.5/ggml-model-q5_k.gguf --mmproj models/llava-7b-v1.5/mmproj-model-f16.gguf --image tmp/images/food-0.jpg 

prompt: 'describe the image in detail.'

 The image displays a wooden dining table with two plates of food on it. On one plate, there is a poached egg with spinach and tomatoes, placed on top of a loaf of bread. On the other plate, there is a sandwich. Additionally, there are two cups on the table, one at the top left corner and the other at the top right corner.

Several silverware items are laid out on the table, including a fork, a knife, and two spoons. One of the spoons can be found at the center of the table, while the other is positioned slightly to the left. There is also a bowl located at the bottom right corner of the table. Near the cups, a cell phone and a bottle can be seen, possibly containing a drink.

main: image encoded in   584.40 ms by CLIP (    1.01 ms per image patch)

llama_print_timings:        load time =    2903.70 ms
llama_print_timings:      sample time =       3.63 ms /   173 runs   (    0.02 ms per token, 47632.16 tokens per second)
llama_print_timings: prompt eval time =   12466.38 ms /   621 tokens (   20.07 ms per token,    49.81 tokens per second)
llama_print_timings:        eval time =    5438.51 ms /   173 runs   (   31.44 ms per token,    31.81 tokens per second)
llama_print_timings:       total time =   18005.63 ms

But in any case, benching the performance at the CLIP-level and the LLaMA-level would eliminate a lot variability in the results (like prompt size, length of generated response, etc.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA build performing very poorly on A100 (very long prompt eval time) #3874

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

CUDA build performing very poorly on A100 (very long prompt eval time) #3874

Uh oh!

trzy Oct 31, 2023

Replies: 3 comments · 5 replies

Uh oh!

slaren Oct 31, 2023 Maintainer

Uh oh!

slaren Oct 31, 2023 Maintainer

Uh oh!

Galunid Oct 31, 2023 Collaborator

Uh oh!

taikai-zz Oct 31, 2023

Uh oh!

trzy Oct 31, 2023 Author

Uh oh!

ggerganov Nov 1, 2023 Maintainer

Uh oh!

trzy Nov 2, 2023 Author

Uh oh!

ggerganov Nov 2, 2023 Maintainer

trzy
Oct 31, 2023

Replies: 3 comments 5 replies

slaren
Oct 31, 2023
Maintainer

slaren Oct 31, 2023
Maintainer

Galunid Oct 31, 2023
Collaborator

taikai-zz
Oct 31, 2023

trzy
Oct 31, 2023
Author

ggerganov Nov 1, 2023
Maintainer

trzy Nov 2, 2023
Author

ggerganov Nov 2, 2023
Maintainer