GGUF support

### Motivation

AWQ is nice, but if you want more control over the bit depth (thus VRAM usage), then GGUF may be a better option. A wide range of models are available from TheBloke at various bit depths, so everybody can use the biggest one which can fit into their GPUs.

I cannot find a high-throughput batch inference engine which can load GGUF, maybe there is none. (vLLM cannot load it either.)

### Related resources

https://github.com/ggerganov/llama.cpp

https://huggingface.co/TheBloke

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

GGUF support #1002

Motivation

Related resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

GGUF support #1002

Description

Motivation

Related resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions