Replies: 1 comment
-
That's the line I comment out to get it working. I comment that line out and build it again with rpc support and cuda. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I try to run model with RPC and --flash-attn --cache-type-k q8_0 --cache-type-v q8_0, on rpc machine run:
build-rpc-cuda/bin/rpc-server --host 0.0.0.0 --port 50052
on server run:
build-rpc-cuda/bin/llama-server --port 8989 -m ../llama_cpp/Qwen2.5-32B.Q8_0.gguf -p "Hello, you are coder assistant" -ngl 48 --n-predict -1 --ctx-size 8192 --threads 4 --no-mmap --temp 0.3 --rpc 192.168.3.2:50052 --tensor-split 4,24,20 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0
and get:
`common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
/media/yesh/wd/AI/llama.cpp/ggml/src/ggml-rpc.cpp:467: GGML_ASSERT(tensor->ne[0] % 512 == 0 && "unsupported quantized tensor") failed
[New LWP 4401]
[New LWP 4408]
[New LWP 4409]
[New LWP 4410]
[New LWP 4411]
[New LWP 4412]
[New LWP 4413]
[New LWP 4414]
[New LWP 4415]
[New LWP 4416]
[New LWP 4417]
[New LWP 4418]
[New LWP 4419]
[New LWP 4420]
[New LWP 4421]
[New LWP 4422]
[New LWP 4423]
[New LWP 4424]
[New LWP 4425]
[New LWP 4426]
[New LWP 4427]
[New LWP 4428]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f362b8f2b57 in __GI___wait4 (pid=4435, stat_loc=0x7ffd6592c004, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0 0x00007f362b8f2b57 in __GI___wait4 (pid=4435, stat_loc=0x7ffd6592c004, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x00007f362ba3de28 in ggml_abort () from /media/yesh/wd/AI/llama.cpp/build-rpc-cuda/ggml/src/libggml.so
#2 0x00007f362bcb801a in ggml_backend_rpc_buffer_init_tensor(ggml_backend_buffer*, ggml_tensor*) () from /media/yesh/wd/AI/llama.cpp/build-rpc-cuda/ggml/src/libggml.so
#3 0x00007f362ba81eb8 in ggml_gallocr_alloc_graph () from /media/yesh/wd/AI/llama.cpp/build-rpc-cuda/ggml/src/libggml.so
#4 0x00007f362ba875eb in ggml_backend_sched_alloc_graph () from /media/yesh/wd/AI/llama.cpp/build-rpc-cuda/ggml/src/libggml.so
#5 0x00007f364153afc8 in llama_decode_internal(llama_context&, llama_batch) () from /media/yesh/wd/AI/llama.cpp/build-rpc-cuda/src/libllama.so
#6 0x00007f364153cf67 in llama_decode () from /media/yesh/wd/AI/llama.cpp/build-rpc-cuda/src/libllama.so
#7 0x0000561a93bf5127 in common_init_from_params(common_params&) ()
#8 0x0000561a93b8cc7d in server_context::load_model(common_params const&) ()
#9 0x0000561a93b3e70e in main ()
[Inferior 1 (process 4400) detached]
`
can you help me to properly run model with --cache-type-k q8_0 --cache-type-v q8_0 and rpc?
without using --cache-type-k q8_0 --cache-type-v q8_0 it doesn't make sense, because the remote machine only has 4GB of vram and with large volumes of k and v it works faster locally
Beta Was this translation helpful? Give feedback.
All reactions