Skip to content

difference in memory requirement for ollama 3.1-8B and same model quantized using Q4_K_M #8793

Answered by dspasyuk
amygbAI asked this question in Q&A
Discussion options

You must be logged in to vote

@amygbAI since you use --ctx_size:0 llama.cpp will try to set context window to 128k you need to set the limit to something like 8000 tokens and enable flash attention -fa
so:
./llama-cli -m ../Meta-Llama-3.1-8B-Instruct/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --ctx_size 8000 -fa -p "how many medals has simone biles won so far" -n 128

Replies: 2 comments 2 replies

Comment options

You must be logged in to vote
1 reply
@amygbAI
Comment options

Answer selected by amygbAI
Comment options

You must be logged in to vote
1 reply
@amygbAI
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants