Something wrong when i try to use speculative decoding in llama.cpp #9228

bulaikexiansheng · 2024-08-29T01:44:34Z

bulaikexiansheng
Aug 29, 2024

I try to use the speculative decoding script, the command is show below:

./llama-speculative \
-m /home/liwenyuan/models/llama-2-7b-instruct/ggml-model-f16.gguf \
-md /home/liwenyuan/models/llama-2-7b-instruct/ggml-model-Q4_K_M.gguf \
-p "// Quick-sort implementation in C (4 spaces indentation + detailed comments) and sample usage:\n\n#include" \
-e -ngl 32 -t 4 -n 256 -c 4096 -s 8 --top_k 1 --draft 16

But i find that model'weight has been offloaded to GPU. But the GPU is not utilized.

Is there something wrong?

bulaikexiansheng · 2024-08-29T01:45:28Z

bulaikexiansheng
Aug 29, 2024
Author

I use the top command. And i find the cpu is used in high rate.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Something wrong when i try to use speculative decoding in llama.cpp #9228

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Something wrong when i try to use speculative decoding in llama.cpp #9228

Uh oh!

bulaikexiansheng Aug 29, 2024

Replies: 1 comment

Uh oh!

bulaikexiansheng Aug 29, 2024 Author

bulaikexiansheng
Aug 29, 2024

bulaikexiansheng
Aug 29, 2024
Author