Installation

1. Install llama.cpp

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
pip install -r requirements.txt
cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
cmake --build build --config Release
cd ..

2. Install this script's requirements

pip install -r requirements.txt

Usage

LLMs quantization

python quantize_llm.py [ARGUMENTS]

Argument	Description	Type	Default value
--repo=REPOSITORY	Set the model repository. Required	str	None
--outtype=QUANT	Set the outtype of the GGUF file. Do not include this quant in the --quants argument. Required	str	None
--gguf=FILE	Set a current GGUF file to quantize.	str	None
--quants="QUANT-1 QUANT-2 ..."	Set the list of quants to quantize the model with. Separated by spaces, the quant names must be valid. If not set, the model will only be converted to GGUF.	list (str separated by spaces)	""
--output-dir=DIRECTORY	Override the default output directory.	str	"output"
--cache-dir=DIRECTORY	Override the default cache directory.	str	"cache"
--lcpp-dir=DIRECTORY	Override the default llama.cpp directory.	str	"llama.cpp"
--lcpp-pre-gguf=COMMAND	Override the default command to execute when converting to GGUF.	str	"python"
--lcpp-gguf=FILE	Override the default script file to execute when converting to GGUF.	str	"convert_hf_to_gguf.py"
--lcpp-pre-quant=COMMAND	Override the default command to execute when quantizing.	str	""
--lcpp-quant=FILE	Override the default script file to execute when quantizing.	str	"build/bin/llama-quantize"
--model-card-template=TEMPLATE	Override the default model card template.	str	Check the script.
--repo-name-template=TEMPLATE	Override the default repository name template.	str	Check the script.
--repo-public	Make the created repository public.	-	False
--test	Test the script to make sure it works without executing commands.	-	False
--as-dir	Uploads the entire model directory in a single commit, instead of uploading files one by one.	-	False

LLM quantization methods

Q2_K: Normal Q2_K quantization. Most weights are in Q2_K. Not recommended for most LLMs due to it's small precision.
Q2_K_L: Uses Q8_0 for the embedding and output weights, and mostly Q2_K for everything else. Has more precision, but still it's not recommended because it's mostly Q2_K.
Q2_K_XL: Uses F16 for the embedding and output weights, and mostly Q2_K for everything else. Has even more precision, but still not recommended because it's mostly Q2_K.
Q3_K_S: Normal Q3_K_S quantization. Most weights are in Q3_K. Not recommended for most use cases due to it's small precision.
Q3_K_M: Normal Q3_K_M quantization. There are more weights in other quantizations like Q5_K and others, but mostly it is Q3_K. Not recommended for most use cases due to it's small precision.
Q3_K_L: Normal Q3_K_L quantization. There are even more weights in other quantizations, but mostly it is Q3_K. If possible, prefer Q3_K_XL, but this might have decent results in some use cases. Only recommended if you have a very slow CPU, GPU, or RAM capacity.
Q3_K_XL: Uses Q8_0 for the embedding and output weights, and Q3_K_L for everything else. This might have decent results in some use cases.
Q3_K_XXL: Uses F16 for the embedding and output weights, and Q3_K_L for everything else. Prefer this only if you want more precision for the embedding or output weights. For most models the size of this quant is similar to Q4_K_S or Q4_K_M. Prefer Q4_K_S or Q4_K_M if the size is similar.
Q4_K_S: Normal Q4_K_S quantization. Most weights are in Q4_K. Gives decent results for most use cases. Slightly lower quality than Q4_K_M and requires less CPU, GPU, and RAM.
Q4_K_M: Normal Q4_K_M quantization. There are more weights in other quantizations like Q5_K and others, but mostly it is Q4_K. Gives decent results for most use cases. Good quality.
Q4_K_L: Uses Q8_0 for the embedding and output weights, and Q4_K_M for everything else. More precision than Q4_K_M.
Q4_K_XL: Uses F16 for the embedding and output weights, and Q4_K_M for everything else. Prefer Q5_K_S or Q5_K_M if the size is similar.
Q5_K_S: Normal Q5_K_S quantization. Most weights are in Q5_K. High quality and very good results. Very similar to Q5_K_M but saving a bit more of memory.
Q5_K_M: Normal Q5_K_M quantization. There are more weights in other quantizations like Q6_K and others, but mostly it is Q5_K. High quality and very good results.
Q5_K_L: Uses Q8_0 for the embedding and output weights, and Q5_K_M for everything else. High quality and very good results.
Q5_K_XL: Uses F16 for the embedding and output weights, and Q5_K_M for everything else. Prefer Q6_K if the size is similar.
Q6_K: Normal Q6_K quantization. Most weights are in Q6_K. Very high quality. Results similar to Q8_0.
Q6_K_L: Uses Q8_0 for the embedding and output weights, and mostly Q6_K for everything else. Very high quality. Results more similar to Q8_0 or F16.
Q6_K_XL: Uses F16 for the embedding and output weights, and mostly Q6_K for everything else. Prefer Q6_K_L.
Q8_0: Normal Q8_0 quantization. Most weights are in Q8_0. Quality almost like F16, saving around half the memory required for F16.
Q8_K_XL: Uses F16 for the embedding and output weights, and mostly Q8_0 for everything else. Prefer Q8_0.
F16: Normal F16 quantization. Most weights are in F16. If the model has been trained in BF16 or F32, prefer BF16. Not recommended because Q8_0 has almost the same quality. Only use this if the model has been trained in F16 and you really need full precision.
BF16: Normal BF16 quantization. Most weights are in F16 or BF16. Not recommended because Q8_0 has almost the same quality. Only use this if the model has been trained in BF16 or F32 and you really need full precision.
F32: Normal F32 quantization. Most -if not all- weights are in F32. Not recommended because F16, BF16, and Q8_0 has almost the same quality. Most LLMs are more than fine if you use F16, BF16, or Q8_0. Only use this if the model has been trained in F32 and you really need full precision.

Note

In most cases, Q8_0 for embedding and output weights is enough, F16 doesn't make much difference.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
log.py		log.py
quantize_llm.py		quantize_llm.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Installation

1. Install llama.cpp

2. Install this script's requirements

Usage

LLMs quantization

LLM quantization methods

About

Uh oh!

Languages

License

TAO71-AI/AutoQuantizer

Folders and files

Latest commit

History

Repository files navigation

Installation

1. Install llama.cpp

2. Install this script's requirements

Usage

LLMs quantization

LLM quantization methods

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages