Skip to content

Bug: on AMD gpu, it offloads all the work to the CPU unless you specify --n-gpu-layers on the llama-cli command line #8164

@differentprogramming

Description

@differentprogramming

What happened?

I spent days trying to figure out why it running a llama 3 instruct model was going super slow (about 3 tokens per second on fp16 and 5.6 on 8 bit) on an AMD MI50 32GB using rocBLAS for ROCm 6.1.2, using 0% GPU and 100% cpu even while using some vram. I'm currently using release b3246

Finally I noticed that (for 8 bit) it said:


llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  8137.64 MiB

adding something to the command line like "--n-gpu-layers 100" changed it to

llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      ROCm0 buffer size =  7605.33 MiB
llm_load_tensors:        CPU buffer size =   532.31 MiB

and it jumped from

llama_print_timings:        load time =    1955.34 ms
llama_print_timings:      sample time =      15.28 ms /   128 runs   (    0.12 ms per token,  8378.61 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   23640.83 ms /   128 runs   (  184.69 ms per token,     5.41 tokens per second)
llama_print_timings:       total time =   23754.79 ms /   128 tokens

to

llama_print_timings:        load time =    2824.15 ms
llama_print_timings:      sample time =      12.57 ms /   128 runs   (    0.10 ms per token, 10182.98 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =    2469.71 ms /   128 runs   (   19.29 ms per token,    51.83 tokens per second)
llama_print_timings:       total time =    2566.10 ms /   128 tokens

The CPU usage went down from maxing out a bunch of cores to only using one.
The GPU usage went from 0% up to the 99%

16 bit didn't improve as much
it went from

llama_print_timings:        load time =    2497.93 ms
llama_print_timings:      sample time =      15.75 ms /   128 runs   (    0.12 ms per token,  8126.47 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   43228.83 ms /   128 runs   (  337.73 ms per token,     2.96 tokens per second)
llama_print_timings:       total time =   43344.71 ms /   128 tokens

to

llama_print_timings:        load time =    3937.74 ms
llama_print_timings:      sample time =      13.31 ms /   128 runs   (    0.10 ms per token,  9616.11 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =    8066.54 ms /   128 runs   (   63.02 ms per token,    15.87 tokens per second)
llama_print_timings:       total time =    8166.01 ms /   128 tokens

This is the command line I'm testing with

build/bin/llama-cli -m ./models/llama3-8b-instruct/ggml-model-q8_0.gguf -n 128 --threads 12 --n-gpu-layers 100

after building with

HIPCXX="/opt/rocm-6.1.2/llvm/bin/clang" HIP_PATH="/opt/rocm-6.1.2" cmake -S . -B build -DGGML_HIPBLAS=ON -DAMDGPU_TARGETS=gfx906 -DCMAKE_BUILD_TYPE=Release && cmake --build build --config Release -- -j 12

Yes I put the paths in manually. It was just part of how I was messing around to see what's wrong

Name and Version

version: 0 (unknown)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

The version string seems to be another bug, but I'm using the source from release b3246

What operating system are you seeing the problem on?

Linux

Relevant log output

llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  8137.64 MiB

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions