difference in memory requirement for ollama 3.1-8B and same model quantized using Q4_K_M #8793

amygbAI · 2024-07-31T11:34:07Z

amygbAI
Jul 31, 2024

i quantized the 3.1-8B instruct model and the size is as expected at ~5GB ( used Q4_K_M )
1 ubuntu ubuntu 4920738944 Jul 31 11:17 ../Meta-Llama-3.1-8B-Instruct/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

but when i run this using
./llama-cli -m ../Meta-Llama-3.1-8B-Instruct/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -p "how many medals has simone biles won so far" -n 128

its trying to reserve 16 GB on the server !!
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 17179869216
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
llama_init_from_gpt_params: error: failed to create context with model '../Meta-Llama-3.1-8B-Instruct/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf'

i used ollama's container and it ran perfectly fine ( and quite accurately ) on the same 8GB RAM .. can someone please lmk whats going on here ?

Answered by dspasyuk

Jul 31, 2024

@amygbAI since you use --ctx_size:0 llama.cpp will try to set context window to 128k you need to set the limit to something like 8000 tokens and enable flash attention -fa
so:
./llama-cli -m ../Meta-Llama-3.1-8B-Instruct/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --ctx_size 8000 -fa -p "how many medals has simone biles won so far" -n 128

View full answer

dspasyuk · 2024-07-31T16:57:16Z

dspasyuk
Jul 31, 2024

@amygbAI since you use --ctx_size:0 llama.cpp will try to set context window to 128k you need to set the limit to something like 8000 tokens and enable flash attention -fa
so:
./llama-cli -m ../Meta-Llama-3.1-8B-Instruct/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --ctx_size 8000 -fa -p "how many medals has simone biles won so far" -n 128

1 reply

amygbAI Aug 1, 2024
Author

thanks so much ..works like a charm .. if i am seeing hallucinations, would u suggest i try different quantization flags ?

dspasyuk · 2024-08-01T02:16:29Z

dspasyuk
Aug 1, 2024

@amygbAI I would suggest to dial down temperature and --top_k to less than 0.3 and 10 respectively. You will need to use proper prompt template or use --in-prefix --in-suffix to set chat template keywords and I would ask model to be what you want it to be for example, this should result in the proper answer and no garbage:

1 reply

amygbAI Aug 1, 2024
Author

good sir .. you rock 👍 .. this is a great start for me to start playing around .. cant thank you enough

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

difference in memory requirement for ollama 3.1-8B and same model quantized using Q4_K_M #8793

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

difference in memory requirement for ollama 3.1-8B and same model quantized using Q4_K_M #8793

Uh oh!

Uh oh!

amygbAI Jul 31, 2024

Replies: 2 comments · 2 replies

Uh oh!

Uh oh!

dspasyuk Jul 31, 2024

Uh oh!

amygbAI Aug 1, 2024 Author

Uh oh!

dspasyuk Aug 1, 2024

Uh oh!

amygbAI Aug 1, 2024 Author

amygbAI
Jul 31, 2024

Replies: 2 comments 2 replies

dspasyuk
Jul 31, 2024

amygbAI Aug 1, 2024
Author

dspasyuk
Aug 1, 2024

amygbAI Aug 1, 2024
Author