-
Notifications
You must be signed in to change notification settings - Fork 12.4k
Closed
Labels
bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)
Description
What happened?
I was running Llama-3 on 3090 and I encountered the same performance problem in #1376.
When using grammar files, sample time becomes very long and GPU utilization dropped from 70%+(when not using grammar) to 10%.
I tried two different fine-tuned version of Llama-3 and the problem remains.
With Llama-2 there is no such problem. So I believe it is due to some kind of bug in llama.cpp
I offloaded all layers to GPU and I believe I have llama.cpp properly configured.
Name and Version
version: 2998 (9588f19)
built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
Llama-3-8B-Instruct with grammar:
llama_print_timings: load time = 195.81 ms
llama_print_timings: sample time = 7656.05 ms / 90 runs ( 85.07 ms per token, 11.76 tokens per second)
llama_print_timings: prompt eval time = 192.27 ms / 410 tokens ( 0.47 ms per token, 2132.44 tokens per second)
llama_print_timings: eval time = 944.78 ms / 89 runs ( 10.62 ms per token, 94.20 tokens per second)
llama_print_timings: total time = 9298.97 ms / 499 tokens
Llama3-8B-Instruct without grammar:
llama_print_timings: load time = 193.30 ms
llama_print_timings: sample time = 387.66 ms / 233 runs ( 1.66 ms per token, 601.04 tokens per second)
llama_print_timings: prompt eval time = 192.93 ms / 410 tokens ( 0.47 ms per token, 2125.09 tokens per second)
llama_print_timings: eval time = 2355.86 ms / 232 runs ( 10.15 ms per token, 98.48 tokens per second)
llama_print_timings: total time = 3277.20 ms / 642 tokens
Llama-2-8B with grammar:
llama_print_timings: load time = 210.30 ms
llama_print_timings: sample time = 354.68 ms / 54 runs ( 6.57 ms per token, 152.25 tokens per second)
llama_print_timings: prompt eval time = 209.69 ms / 464 tokens ( 0.45 ms per token, 2212.84 tokens per second)
llama_print_timings: eval time = 492.42 ms / 53 runs ( 9.29 ms per token, 107.63 tokens per second)
llama_print_timings: total time = 1128.22 ms / 517 tokens
Llama-2-8B without grammar:
llama_print_timings: load time = 194.85 ms
llama_print_timings: sample time = 153.25 ms / 367 runs ( 0.42 ms per token, 2394.76 tokens per second)
llama_print_timings: prompt eval time = 194.44 ms / 464 tokens ( 0.42 ms per token, 2386.38 tokens per second)
llama_print_timings: eval time = 3512.26 ms / 366 runs ( 9.60 ms per token, 104.21 tokens per second)
llama_print_timings: total time = 4094.80 ms / 830 tokens
skoulik
Metadata
Metadata
Assignees
Labels
bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)