Challenges in Quantizing llama.cpp Models on Windows #10730

jasonsu123 · 2024-12-09T03:30:54Z

jasonsu123
Dec 9, 2024

Hello everyone,
Previously, I asked how to convert the safetensors model from the Hugging Face website into a GGUF file. Later, someone provided instructional resources, and I'm currently able to convert it to a GGUF file using the convert_hf_to_gguf.py script from llama.cpp.
The process is as follows:

Enter the following commands in the CMD:

git clone https://github.com/ggerganov/llama.cpp.git
pip install -r requirements.txt

Then, I use the downloaded llama.cpp code in the llama.cpp folder to execute the following command:
python convert_hf_to_gguf.py D:\Ollama\TAIDE-LX-8B-Chat-Alpha1 --outfile D:\Ollama\TAIDE-LX-8B-Chat-Alpha1-q8_0.gguf --outtype q8_0

However, I'm unable to proceed further with the quantization.
For example, when I try to quantize to q4
I encounter an issue:
Copyerror: argument --outtype: invalid choice: 'q4_0' (choose from 'f32', 'f16', 'bf16', 'q8_0', 'tq1_0', 'tq2_0', 'auto')

It seems that I need to use the ./quantize or ./llama-quantize command, such as the examples in the tutorials:

./quantize ./models/Bailong-instruct-7B-f16.gguf ./models/Bailong-instruct-7B-v0.1-Q5_K_M.gguf q5_k_m
./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M

However, I'm using Windows 10, so how can I modify these commands to work in my terminal?
It seems that the quantization process can only be done in a Linux environment, but I'm a programming newbie and don't know how to compile the quantize tool and then use it to quantize the GGUF model.
Could someone please provide a simple tutorial on how to do this?

I would really appreciate it.
Thank you.

ggerganov · 2024-12-09T08:16:07Z

ggerganov
Dec 9, 2024
Maintainer

🤖: Sure, here's a concise guide to help you through the process on Windows 10:

First, quantize your model to f16:

python convert_hf_to_gguf.py D:\Ollama\TAIDE-LX-8B-Chat-Alpha1 --outfile D:\Ollama\TAIDE-LX-8B-Chat-Alpha1-f16.gguf --outtype f16

Build the llama.cpp tools on Windows:

Open Command Prompt and navigate to the llama.cpp directory:
```
cd path\to\llama.cpp
mkdir build
cd build
cmake ..
cmake --build . --config Release
```

Quantize the f16 model to q4_0:

Use the llama-quantize tool from the build directory:

.\Release\llama-quantize.exe D:\Ollama\TAIDE-LX-8B-Chat-Alpha1-f16.gguf D:\Ollama\TAIDE-LX-8B-Chat-Alpha1-q4_0.gguf q4_0

This should help you quantize your model to q4_0 on Windows 10. Let me know if you encounter any issues!

👨: Btw, if step 2 fails, you can download the pre-build executables from https://github.com/ggerganov/llama.cpp/releases

3 replies

jasonsu123 Dec 9, 2024
Author

Thank you very much for your detailed response, the steps are very clear.

Here are the steps I followed so far:

shell
cd path\to\llama.cpp
mkdir build
cd build
Up to this point, everything works fine, and a build folder is created at D:\Ollama\llama.cpp\build.

However, when I tried running the next commands, I encountered an error:
cmake ..
cmake --build . --config Release

The error says:
'cmake' is not recognized as an internal or external command, operable program, or batch file.

Could you help me resolve this issue?
May I ask what software or pip packages I need to install first?
Also, regarding the pre-built executables you provided at https://github.com/ggerganov/llama.cpp/releases:
Which one should I download?
I’m using a laptop with Windows 10, 64-bit, and an NVIDIA 2060 GPU.

Previously, I downloaded cudart-llama-bin-win-cu11.7-x64.zip, but it only contained .dll files,
and I’m not sure how to proceed with step 2 to prepare for quantization.

Thank you!

ggerganov Dec 9, 2024
Maintainer

Try to download llama-b4293-bin-win-cuda-cu11.7-x64.zip - it should contain the executables. If they don't run, you maybe need to add the DLLs from cudart-llama-bin-win-cu11.7-x64.zip in the same folder as the executables. I don't use Windows, so I am not very sure. Hopefully somebody else will be able to help if this does not work.

jasonsu123 Dec 9, 2024
Author

Thank you for your reply. Could you clarify what the "executables" you mentioned are? (which file name in D:\Ollama\llama.cpp)

I downloaded the cudart-llama-bin-win-cu11.7-x64.zip file, but it only contains these three DLL files:
cublas64_11.dll
cublasLt64_11.dll
cudart64_110.dll

You mentioned that these files should be extracted and placed in the D:\Ollama\llama.cpp directory.
However, when I run cmake .., it still doesn’t work, and I get the following error:
'cmake' is not recognized as an internal or external command, operable program, or batch file.

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Challenges in Quantizing llama.cpp Models on Windows #10730

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Challenges in Quantizing llama.cpp Models on Windows #10730

Uh oh!

jasonsu123 Dec 9, 2024

Replies: 1 comment · 3 replies

Uh oh!

ggerganov Dec 9, 2024 Maintainer

Uh oh!

jasonsu123 Dec 9, 2024 Author

Uh oh!

ggerganov Dec 9, 2024 Maintainer

Uh oh!

Uh oh!

jasonsu123 Dec 9, 2024 Author

jasonsu123
Dec 9, 2024

Replies: 1 comment 3 replies

ggerganov
Dec 9, 2024
Maintainer

jasonsu123 Dec 9, 2024
Author

ggerganov Dec 9, 2024
Maintainer

jasonsu123 Dec 9, 2024
Author