Bug: on AMD gpu, it offloads all the work to the CPU unless you specify --n-gpu-layers on the llama-cli command line

### What happened?

I spent days trying to figure out why it running a llama 3 instruct model was going super slow (about 3 tokens per second on fp16 and 5.6 on 8 bit) on an AMD MI50 32GB using rocBLAS for ROCm 6.1.2, using 0% GPU and 100% cpu even while using some vram.  I'm currently using release b3246

Finally I noticed that (for 8 bit) it said:
```

llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  8137.64 MiB

```

adding something to the command line like "--n-gpu-layers 100" changed it to
```
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      ROCm0 buffer size =  7605.33 MiB
llm_load_tensors:        CPU buffer size =   532.31 MiB
```
and it jumped from
```
llama_print_timings:        load time =    1955.34 ms
llama_print_timings:      sample time =      15.28 ms /   128 runs   (    0.12 ms per token,  8378.61 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   23640.83 ms /   128 runs   (  184.69 ms per token,     5.41 tokens per second)
llama_print_timings:       total time =   23754.79 ms /   128 tokens
```

to
```
llama_print_timings:        load time =    2824.15 ms
llama_print_timings:      sample time =      12.57 ms /   128 runs   (    0.10 ms per token, 10182.98 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =    2469.71 ms /   128 runs   (   19.29 ms per token,    51.83 tokens per second)
llama_print_timings:       total time =    2566.10 ms /   128 tokens
```
The CPU usage went down from maxing out a bunch of cores to only using one.
The GPU usage went from 0% up to the 99%

16 bit didn't improve as much
it went from 
```
llama_print_timings:        load time =    2497.93 ms
llama_print_timings:      sample time =      15.75 ms /   128 runs   (    0.12 ms per token,  8126.47 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   43228.83 ms /   128 runs   (  337.73 ms per token,     2.96 tokens per second)
llama_print_timings:       total time =   43344.71 ms /   128 tokens
```

to
```
llama_print_timings:        load time =    3937.74 ms
llama_print_timings:      sample time =      13.31 ms /   128 runs   (    0.10 ms per token,  9616.11 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =    8066.54 ms /   128 runs   (   63.02 ms per token,    15.87 tokens per second)
llama_print_timings:       total time =    8166.01 ms /   128 tokens
```

This is the command line I'm testing with

`build/bin/llama-cli -m ./models/llama3-8b-instruct/ggml-model-q8_0.gguf -n 128 --threads 12 --n-gpu-layers 100`

after building with

`HIPCXX="/opt/rocm-6.1.2/llvm/bin/clang" HIP_PATH="/opt/rocm-6.1.2"  cmake -S . -B build -DGGML_HIPBLAS=ON -DAMDGPU_TARGETS=gfx906 -DCMAKE_BUILD_TYPE=Release && cmake --build build  --config Release -- -j 12`

Yes I put the paths in manually.  It was just part of how I was messing around to see what's wrong

### Name and Version

version: 0 (unknown)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

The version string seems to be another bug, but I'm using the source from release b3246

### What operating system are you seeing the problem on?

Linux

### Relevant log output

```shell
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  8137.64 MiB
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: on AMD gpu, it offloads all the work to the CPU unless you specify --n-gpu-layers on the llama-cli command line #8164

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: on AMD gpu, it offloads all the work to the CPU unless you specify --n-gpu-layers on the llama-cli command line #8164

Description

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions