-
Notifications
You must be signed in to change notification settings - Fork 12.4k
Description
What happened?
I spent days trying to figure out why it running a llama 3 instruct model was going super slow (about 3 tokens per second on fp16 and 5.6 on 8 bit) on an AMD MI50 32GB using rocBLAS for ROCm 6.1.2, using 0% GPU and 100% cpu even while using some vram. I'm currently using release b3246
Finally I noticed that (for 8 bit) it said:
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors: CPU buffer size = 8137.64 MiB
adding something to the command line like "--n-gpu-layers 100" changed it to
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: ROCm0 buffer size = 7605.33 MiB
llm_load_tensors: CPU buffer size = 532.31 MiB
and it jumped from
llama_print_timings: load time = 1955.34 ms
llama_print_timings: sample time = 15.28 ms / 128 runs ( 0.12 ms per token, 8378.61 tokens per second)
llama_print_timings: prompt eval time = 0.00 ms / 0 tokens ( -nan ms per token, -nan tokens per second)
llama_print_timings: eval time = 23640.83 ms / 128 runs ( 184.69 ms per token, 5.41 tokens per second)
llama_print_timings: total time = 23754.79 ms / 128 tokens
to
llama_print_timings: load time = 2824.15 ms
llama_print_timings: sample time = 12.57 ms / 128 runs ( 0.10 ms per token, 10182.98 tokens per second)
llama_print_timings: prompt eval time = 0.00 ms / 0 tokens ( -nan ms per token, -nan tokens per second)
llama_print_timings: eval time = 2469.71 ms / 128 runs ( 19.29 ms per token, 51.83 tokens per second)
llama_print_timings: total time = 2566.10 ms / 128 tokens
The CPU usage went down from maxing out a bunch of cores to only using one.
The GPU usage went from 0% up to the 99%
16 bit didn't improve as much
it went from
llama_print_timings: load time = 2497.93 ms
llama_print_timings: sample time = 15.75 ms / 128 runs ( 0.12 ms per token, 8126.47 tokens per second)
llama_print_timings: prompt eval time = 0.00 ms / 0 tokens ( -nan ms per token, -nan tokens per second)
llama_print_timings: eval time = 43228.83 ms / 128 runs ( 337.73 ms per token, 2.96 tokens per second)
llama_print_timings: total time = 43344.71 ms / 128 tokens
to
llama_print_timings: load time = 3937.74 ms
llama_print_timings: sample time = 13.31 ms / 128 runs ( 0.10 ms per token, 9616.11 tokens per second)
llama_print_timings: prompt eval time = 0.00 ms / 0 tokens ( -nan ms per token, -nan tokens per second)
llama_print_timings: eval time = 8066.54 ms / 128 runs ( 63.02 ms per token, 15.87 tokens per second)
llama_print_timings: total time = 8166.01 ms / 128 tokens
This is the command line I'm testing with
build/bin/llama-cli -m ./models/llama3-8b-instruct/ggml-model-q8_0.gguf -n 128 --threads 12 --n-gpu-layers 100
after building with
HIPCXX="/opt/rocm-6.1.2/llvm/bin/clang" HIP_PATH="/opt/rocm-6.1.2" cmake -S . -B build -DGGML_HIPBLAS=ON -DAMDGPU_TARGETS=gfx906 -DCMAKE_BUILD_TYPE=Release && cmake --build build --config Release -- -j 12
Yes I put the paths in manually. It was just part of how I was messing around to see what's wrong
Name and Version
version: 0 (unknown)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
The version string seems to be another bug, but I'm using the source from release b3246
What operating system are you seeing the problem on?
Linux
Relevant log output
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors: CPU buffer size = 8137.64 MiB