-
i quantized the 3.1-8B instruct model and the size is as expected at ~5GB ( used Q4_K_M ) but when i run this using its trying to reserve 16 GB on the server !! i used ollama's container and it ran perfectly fine ( and quite accurately ) on the same 8GB RAM .. can someone please lmk whats going on here ? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
@amygbAI since you use --ctx_size:0 llama.cpp will try to set context window to 128k you need to set the limit to something like 8000 tokens and enable flash attention -fa |
Beta Was this translation helpful? Give feedback.
-
@amygbAI I would suggest to dial down temperature and --top_k to less than 0.3 and 10 respectively. You will need to use proper prompt template or use --in-prefix --in-suffix to set chat template keywords and I would ask model to be what you want it to be for example, this should result in the proper answer and no garbage:
|
Beta Was this translation helpful? Give feedback.
@amygbAI since you use --ctx_size:0 llama.cpp will try to set context window to 128k you need to set the limit to something like 8000 tokens and enable flash attention -fa
so:
./llama-cli -m ../Meta-Llama-3.1-8B-Instruct/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --ctx_size 8000 -fa -p "how many medals has simone biles won so far" -n 128