I am using a single A100-PCIE-40GB GPU, but I am unable to assign model layers to the GPU. #8331

chenbingweb · 2024-07-06T01:26:41Z

chenbingweb
Jul 6, 2024

I am using a single A100-PCIE-40GB GPU, but I am unable to assign model layers to the GPU.

cmake -B build -DLLAMA_SERVER_SSL=ON
cmake --build build --config Release -t llama-server

/root/llama.cpp/build/bin/llama-server -m ./models/qwen2-7b-instruct-q2_k.gguf -c 2048 -n 400 -e -ngl 33

Answered by dspasyuk

Jul 6, 2024

@chenbingweb what is your ldd /root/llama.cpp/build/bin/llama-server logs? Also why are doing this in /root folder?

View full answer

ggerganov · 2024-07-06T07:01:40Z

ggerganov
Jul 6, 2024
Maintainer

Use cmake -B build -DLLAMA_SERVER_SSL=ON -DGGML_CUDA=ON to enable CUDA

5 replies

chenbingweb Jul 6, 2024
Author

Use cmake -B build -DLLAMA_SERVER_SSL=ON -DGGML_CUDA=ON to enable CUDA

I tried the method you provided. However, when I executed /root/llama.cpp/build/bin/llama-server -m ./models/qwen2-7b-instruct-q2_k.gguf -c 2048 -n 400 -e -ngl 33, the model layers were still not on the GPU.thank you

dspasyuk Jul 6, 2024

@chenbingweb what is your ldd /root/llama.cpp/build/bin/llama-server logs? Also why are doing this in /root folder?

Answer selected by chenbingweb

chenbingweb Jul 7, 2024
Author

@chenbingweb what is your ldd /root/llama.cpp/build/bin/llama-server logs? Also why are doing this in /root folder?

When using the command line ./llama-server -m models/7B/ggml-model.gguf -c 2048, it directly prompts me that the llama-server folder cannot be found, so I used /root/llama.cpp/build/bin/llama-server. I don't know where to check the logs, can you tell me?

dspasyuk Jul 7, 2024

@chenbingweb just run ldd your llama.cli binary, open terminal type: ldd ./llama-server
you should get something like this:
ldd llama-server
linux-vdso.so.1 (0x00007ffc8b7fc000)
libcuda.so.1 => /lib/x86_64-linux-gnu/libcuda.so.1 (0x000071c9ea800000)
libcublas.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcublas.so.12 (0x000071c9e4000000)
libcudart.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12 (0x000071c9e3c00000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x000071c9e3800000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000071ca07015000)
libgomp.so.1 => /lib/x86_64-linux-gnu/libgomp.so.1 (0x000071c9ec5b6000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x000071c9ec596000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000071c9e3400000)
/lib64/ld-linux-x86-64.so.2 (0x000071ca07119000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x000071ca0700e000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x000071c9ec591000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x000071c9ec58c000)
libcublasLt.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcublasLt.so.12 (0x000071c9c0a00000)

If you do not have cuda/cublas linked in your binaries you will not be able to use GPUs

chenbingweb Jul 8, 2024
Author

@chenbingweb just run ldd your llama.cli binary, open terminal type: ldd ./llama-server you should get something like this: ldd llama-server linux-vdso.so.1 (0x00007ffc8b7fc000) libcuda.so.1 => /lib/x86_64-linux-gnu/libcuda.so.1 (0x000071c9ea800000) libcublas.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcublas.so.12 (0x000071c9e4000000) libcudart.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12 (0x000071c9e3c00000) libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x000071c9e3800000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000071ca07015000) libgomp.so.1 => /lib/x86_64-linux-gnu/libgomp.so.1 (0x000071c9ec5b6000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x000071c9ec596000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000071c9e3400000) /lib64/ld-linux-x86-64.so.2 (0x000071ca07119000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x000071ca0700e000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x000071c9ec591000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x000071c9ec58c000) libcublasLt.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcublasLt.so.12 (0x000071c9c0a00000)

If you do not have cuda/cublas linked in your binaries you will not be able to use GPUs

Thank you very much for your assistance. I executed the ldd ./llama-server command, and indeed, there was no information related to libcuda. This is quite puzzling since I have a GPU, and I can use the nvidia-smi command to view the GPU information. Moreover, I am able to fine-tune models on the GPU. However, when using llama.cpp, it seems that the GPU cannot be utilized. This is quite strange.

dspasyuk · 2024-07-08T02:45:06Z

dspasyuk
Jul 8, 2024

@chenbingweb try this in terminal git clone https://github.com/ggerganov/llama.cpp.git; cd llama.cpp; sed -i 's/-arch=native/-arch=all/g' Makefile; make clean && LLAMA_CUDA=1 make -j 6

8 replies

dspasyuk Jul 8, 2024

@chenbingweb oh, you mean quality not speed, I normally never use anything below Q4 models. Try setting --temp 0 --top-k 10 -p "2+2 = ?". Try also using llama-cli for this.

chenbingweb Jul 8, 2024
Author

@chenbingweb oh, you mean quality not speed, I normally never use anything below Q4 models. Try setting --temp 0 --top-k 10 -p "2+2 = ?". Try also using llama-cli for this.

thank you very much

chenbingweb Jul 8, 2024
Author

@chenbingweb oh, you mean quality not speed, I normally never use anything below Q4 models. Try setting --temp 0 --top-k 10 -p "2+2 = ?". Try also using llama-cli for this.

I found that the qwen2-7b-instruct-q2_k.gguf model I used performs well on the CPU for inference, but on the GPU, the inference performance is quite poor. I also downloaded a new qwen2-7b-instruct-q5_0.gguf model, but the performance is still unsatisfactory.

dspasyuk Jul 8, 2024

@chenbingweb try wrapping your prompt in chatml prompt template:

<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
What is 2+2? <|im_end|>
<|im_start|>assistant

or use -cnv option
llama-cli -m ./models/qwen2-7b-instruct-q2_k.gguf -c 0 -cnv --chat-template chatml -ngl 33

chenbingweb Jul 10, 2024
Author

@chenbingweb try wrapping your prompt in chatml prompt template:

<|im_start|>system {system_message}<|im_end|> <|im_start|>user What is 2+2? <|im_end|> <|im_start|>assistant

or use -cnv option llama-cli -m ./models/qwen2-7b-instruct-q2_k.gguf -c 0 -cnv --chat-template chatml -ngl 33

I tried it out, and the results were also poor, with repeated responses, such as repeated G's or repeated exclamation marks. I also experimented with adjusting various parameters, but the performance remained unsatisfactory. After distributing the model layers across the GPU, the performance was very bad, whereas on the CPU, the results were much better. I suspect that the model layers distributed on the GPU might not be functioning during inference.

dspasyuk · 2024-07-10T16:05:02Z

dspasyuk
Jul 10, 2024

@chenbingweb Here is my test using the model from here: https://huggingface.co/Qwen/Qwen2-7B-Instruct-GGUF/blob/main/qwen2-7b-instruct-q2_k.gguf

Using llama.cui from here: https://github.com/dspasyuk/llama.cui It uses the same setting that I mentioned above and the recent release of llama.cpp

Screencast.from.2024-07-10.10.04.06.AM.webm

2 replies

chenbingweb Jul 11, 2024
Author

According to the document you provided, I tried it, but it didn't work. I plan to try another server.

chenbingweb Jul 12, 2024
Author

@chenbingweb Here is my test using the model from here: https://huggingface.co/Qwen/Qwen2-7B-Instruct-GGUF/blob/main/qwen2-7b-instruct-q2_k.gguf

Using llama.cui from here: https://github.com/dspasyuk/llama.cui It uses the same setting that I mentioned above and the recent release of llama.cpp

Screencast.from.2024-07-10.10.04.06.AM.webm

That's fantastic! I switched to a new server and it's up and running smoothly. The results are great, thank you so much!

I am using a single A100-PCIE-40GB GPU, but I am unable to assign model layers to the GPU. #8331

Uh oh!

chenbingweb Jul 6, 2024

Replies: 3 comments · 15 replies

Uh oh!

ggerganov Jul 6, 2024 Maintainer

Uh oh!

chenbingweb Jul 6, 2024 Author

Uh oh!

dspasyuk Jul 6, 2024

Uh oh!

chenbingweb Jul 7, 2024 Author

Uh oh!

Uh oh!

dspasyuk Jul 7, 2024

Uh oh!

chenbingweb Jul 8, 2024 Author

Uh oh!

dspasyuk Jul 8, 2024

Uh oh!

Uh oh!

dspasyuk Jul 8, 2024

Uh oh!

chenbingweb Jul 8, 2024 Author

Uh oh!

chenbingweb Jul 8, 2024 Author

Uh oh!

dspasyuk Jul 8, 2024

Uh oh!

chenbingweb Jul 10, 2024 Author

Uh oh!

Uh oh!

dspasyuk Jul 10, 2024

Uh oh!

chenbingweb Jul 11, 2024 Author

Uh oh!

chenbingweb Jul 12, 2024 Author

chenbingweb
Jul 6, 2024

Replies: 3 comments 15 replies

ggerganov
Jul 6, 2024
Maintainer

chenbingweb Jul 6, 2024
Author

chenbingweb Jul 7, 2024
Author

chenbingweb Jul 8, 2024
Author

dspasyuk
Jul 8, 2024

chenbingweb Jul 8, 2024
Author

chenbingweb Jul 8, 2024
Author

chenbingweb Jul 10, 2024
Author

dspasyuk
Jul 10, 2024

chenbingweb Jul 11, 2024
Author

chenbingweb Jul 12, 2024
Author