To install, run the following command:
curl https://raw.githubusercontent.com/k-koehler/gguf-tensor-overrider/refs/heads/main/install.sh | sudo /bin/bash
gguf-tensor-overrider \
-g https://huggingface.co/unsloth/Qwen3-235B-A22B-GGUF/resolve/main/UD-Q4_K_XL/Qwen3-235B-A22B-UD-Q4_K_XL-00001-of-00003.gguf \
-c 32000 \
--no-check \
--verbose
Tired of fucking around with --tensor-override
regexes in llama.cpp? This tool aims to automatically allocate tensors optimally across your GPUs and CPUs.
gguf-tensor-overrider
does the following:
- Downloads and extracts metadata for a model from Hugging Face, including the complete list of tensors
- Iterates over each tensor and assigns it to GPU (if available) and then to CPU
- Uses multiple passes trying to assign the most critical tensors to GPU before less critical ones. For example, in a MoE model, it will assign expert tensors last
- Generates an output like
-ot "tensor_name_1=<device>" -ot "tensor_name_2=<device>"
gguf-tensor-overrider
uses the following priority in tensor allocation:
- Attention tensors. It attempts to estimate KV cache size to ensure the device allocates these tensor bytes accurately
- FFN tensors
- Gate tensors
- Norm tensors
- Expert tensors and other tensors
-g
,--gguf-url
: The Hugging Face URL for the GGUF. In the case of multipart GGUFs,gguf-tensor-overrider
automatically parses them if you provide the first file.-c
,--context-length
: The context length you're passing to llama.cpp. Used to estimate the KV cache of the model to safely allocate attention tensors--context-quantization-size
: The quantization type of the KV cache. Currently assumes both K and V are quantized to the same type--check
,--no-check
: Check if your system can handle the allocation without using swap--gpu-percentage
(default 0.9): How much of the GPU(s) to use for allocation. Useful if the script didn't allocate the cache accurately--granular-gpu-percentage
: Percentage for each GPU in your system. Useful if you don't want to use a certain GPU or llama.cpp compute buffer is making you sad--verbose
: Logs detailed information. Useful to see where things are being allocated
Here's an example of how you can pipe the arguments into your llama command:
#!/bin/bash
# Generate tensor overrides
TENSOR_OVERRIDES=$(gguf-tensor-overrider -g https://huggingface.co/Qwen/Qwen3-32B-GGUF/resolve/main/Qwen3-32B-Q8_0.gguf -c 32000)
# Build command with tensor overrides
CMD="/home/user/llama.cpp/build/bin/llama-cli \
-m qwen3_32/qwen3_32b.gguf \
-c 32000 \
-fa \
-sm row \
$TENSOR_OVERRIDES"
# Execute command directly
eval "$CMD"
- Only supports NVIDIA
- Only supports llama.cpp
- Only supports GGUF files from Hugging Face
- Doesn't support all architectures, but seems to support most
Go wild. The code in this repository is free, open source, modifiable, distributable, whatever-the-fuck-you-wantable.