Replies: 1 comment
-
You do not have enough memory for the KV cache as command-r does not have GQA would take over 160 GB to store 131k context at fp16. You need to lower the context size using the '--ctx-size' argument. (llama.cpp defaults to the max context size) llama 3 70b has GQA and defaults to 8k context so the memory usage is much lower (about 2.5GB) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
The gguf file comes from https://huggingface.co/lmstudio-community/c4ai-command-r-v01-GGUF.
$ docker run --rm -v /home/wencan/Projects/models:/models -p 8000:8000 ghcr.io/ggerganov/llama.cpp:server -m /models/lmstudio-community/c4ai-command-r-v01-GGUF/c4ai-command-r-v01-Q4_K_M.gguf --port 8000 --host 0.0.0.0 -n 512
Beta Was this translation helpful? Give feedback.
All reactions