RPC with --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 #10277

evilyesh · 2024-11-13T09:04:44Z

evilyesh
Nov 13, 2024

I try to run model with RPC and --flash-attn --cache-type-k q8_0 --cache-type-v q8_0, on rpc machine run:
build-rpc-cuda/bin/rpc-server --host 0.0.0.0 --port 50052

on server run:
build-rpc-cuda/bin/llama-server --port 8989 -m ../llama_cpp/Qwen2.5-32B.Q8_0.gguf -p "Hello, you are coder assistant" -ngl 48 --n-predict -1 --ctx-size 8192 --threads 4 --no-mmap --temp 0.3 --rpc 192.168.3.2:50052 --tensor-split 4,24,20 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0

and get:
`common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)

/media/yesh/wd/AI/llama.cpp/ggml/src/ggml-rpc.cpp:467: GGML_ASSERT(tensor->ne[0] % 512 == 0 && "unsupported quantized tensor") failed
[New LWP 4401]
[New LWP 4408]
[New LWP 4409]
[New LWP 4410]
[New LWP 4411]
[New LWP 4412]
[New LWP 4413]
[New LWP 4414]
[New LWP 4415]
[New LWP 4416]
[New LWP 4417]
[New LWP 4418]
[New LWP 4419]
[New LWP 4420]
[New LWP 4421]
[New LWP 4422]
[New LWP 4423]
[New LWP 4424]
[New LWP 4425]
[New LWP 4426]
[New LWP 4427]
[New LWP 4428]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f362b8f2b57 in __GI___wait4 (pid=4435, stat_loc=0x7ffd6592c004, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0 0x00007f362b8f2b57 in __GI___wait4 (pid=4435, stat_loc=0x7ffd6592c004, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x00007f362ba3de28 in ggml_abort () from /media/yesh/wd/AI/llama.cpp/build-rpc-cuda/ggml/src/libggml.so
#2 0x00007f362bcb801a in ggml_backend_rpc_buffer_init_tensor(ggml_backend_buffer*, ggml_tensor*) () from /media/yesh/wd/AI/llama.cpp/build-rpc-cuda/ggml/src/libggml.so
#3 0x00007f362ba81eb8 in ggml_gallocr_alloc_graph () from /media/yesh/wd/AI/llama.cpp/build-rpc-cuda/ggml/src/libggml.so
#4 0x00007f362ba875eb in ggml_backend_sched_alloc_graph () from /media/yesh/wd/AI/llama.cpp/build-rpc-cuda/ggml/src/libggml.so
#5 0x00007f364153afc8 in llama_decode_internal(llama_context&, llama_batch) () from /media/yesh/wd/AI/llama.cpp/build-rpc-cuda/src/libllama.so
#6 0x00007f364153cf67 in llama_decode () from /media/yesh/wd/AI/llama.cpp/build-rpc-cuda/src/libllama.so
#7 0x0000561a93bf5127 in common_init_from_params(common_params&) ()
#8 0x0000561a93b8cc7d in server_context::load_model(common_params const&) ()
#9 0x0000561a93b3e70e in main ()
[Inferior 1 (process 4400) detached]
`

can you help me to properly run model with --cache-type-k q8_0 --cache-type-v q8_0 and rpc?
without using --cache-type-k q8_0 --cache-type-v q8_0 it doesn't make sense, because the remote machine only has 4GB of vram and with large volumes of k and v it works faster locally

Welsley · 2024-12-12T12:59:36Z

Welsley
Dec 12, 2024

/media/yesh/wd/AI/llama.cpp/ggml/src/ggml-rpc.cpp:467: GGML_ASSERT(tensor->ne[0] % 512 == 0 && "unsupported quantized tensor")

That's the line I comment out to get it working. I comment that line out and build it again with rpc support and cuda.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RPC with --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 #10277

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

RPC with --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 #10277

Uh oh!

Uh oh!

evilyesh Nov 13, 2024

Replies: 1 comment

Uh oh!

Uh oh!

Welsley Dec 12, 2024

evilyesh
Nov 13, 2024

Welsley
Dec 12, 2024