gguf-tensor-overrider

Install

To install, run the following command:

curl https://raw.githubusercontent.com/k-koehler/gguf-tensor-overrider/refs/heads/main/install.sh | sudo /bin/bash

Example Command

gguf-tensor-overrider \
  -g https://huggingface.co/unsloth/Qwen3-235B-A22B-GGUF/resolve/main/UD-Q4_K_XL/Qwen3-235B-A22B-UD-Q4_K_XL-00001-of-00003.gguf \
  -c 32000 \
  --no-check \
  --verbose

Purpose

Tired of fucking around with --tensor-override regexes in llama.cpp? This tool aims to automatically allocate tensors optimally across your GPUs and CPUs.

How it works

gguf-tensor-overrider does the following:

Downloads and extracts metadata for a model from Hugging Face, including the complete list of tensors
Iterates over each tensor and assigns it to GPU (if available) and then to CPU
Uses multiple passes trying to assign the most critical tensors to GPU before less critical ones. For example, in a MoE model, it will assign expert tensors last
Generates an output like -ot "tensor_name_1=<device>" -ot "tensor_name_2=<device>"

gguf-tensor-overrider uses the following priority in tensor allocation:

Attention tensors. It attempts to estimate KV cache size to ensure the device allocates these tensor bytes accurately
FFN tensors
Gate tensors
Norm tensors
Expert tensors and other tensors

Options

-g, --gguf-url: The Hugging Face URL for the GGUF. In the case of multipart GGUFs, gguf-tensor-overrider automatically parses them if you provide the first file.
-c, --context-length: The context length you're passing to llama.cpp. Used to estimate the KV cache of the model to safely allocate attention tensors
--context-quantization-size: The quantization type of the KV cache. Currently assumes both K and V are quantized to the same type
--check, --no-check: Check if your system can handle the allocation without using swap
--gpu-percentage (default 0.9): How much of the GPU(s) to use for allocation. Useful if the script didn't allocate the cache accurately
--granular-gpu-percentage: Percentage for each GPU in your system. Useful if you don't want to use a certain GPU or llama.cpp compute buffer is making you sad
--verbose: Logs detailed information. Useful to see where things are being allocated

How to pipe this into my llama command?

Here's an example of how you can pipe the arguments into your llama command:

#!/bin/bash

# Generate tensor overrides
TENSOR_OVERRIDES=$(gguf-tensor-overrider -g https://huggingface.co/Qwen/Qwen3-32B-GGUF/resolve/main/Qwen3-32B-Q8_0.gguf -c 32000)

# Build command with tensor overrides
CMD="/home/user/llama.cpp/build/bin/llama-cli \
  -m qwen3_32/qwen3_32b.gguf \
  -c 32000 \
  -fa \
  -sm row \
  $TENSOR_OVERRIDES"

# Execute command directly
eval "$CMD"

Gotchas (for now)

Only supports NVIDIA
Only supports llama.cpp
Only supports GGUF files from Hugging Face
Doesn't support all architectures, but seems to support most

Can I use this code for xyz?

Go wild. The code in this repository is free, open source, modifiable, distributable, whatever-the-fuck-you-wantable.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
__test__		__test__
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
install.sh		install.sh
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

gguf-tensor-overrider

Install

Example Command

Purpose

How it works

Options

How to pipe this into my llama command?

Gotchas (for now)

Can I use this code for xyz?

About

Uh oh!

Releases

Packages

Languages

k-koehler/gguf-tensor-overrider

Folders and files

Latest commit

History

Repository files navigation

gguf-tensor-overrider

Install

Example Command

Purpose

How it works

Options

How to pipe this into my llama command?

Gotchas (for now)

Can I use this code for xyz?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages