Config main an compilate for TESLA P40 card #7467
Unanswered
gandolfi974
asked this question in
Q&A
Replies: 2 comments
-
That model is too big to load fully into VRAM. You need a smaller quant so you can fit all layers and KV cache on the card, otherwise you are partly using the CPU which is making it slow. Find a quant that fits in 24GB and leave enough room for context. |
Beta Was this translation helpful? Give feedback.
0 replies
-
thanks. Smaller quant like q3 or q2 are to bad no ?. I think Mixtral model can't be used maybe i have a 24go Vram card. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
hi, i have a Tesla p40 card, it's slow with ollama and Mixtral 8x7b.
Someone advise me to test compiling llama.cpp with "-DLLAMA_CUDA=ON -DLLAMA_CLBLAST=ON -DLLAMA_CUDA_FORCE_MMQ=ON" in order to use FP32 and acceleration on this old cuda card.
it's faster than ollama but i can't use it for conversation. i talk alone and close. i use this command
main.exe -ngl 29 -m D:\GITHUB\llamamodel.huggingface\mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "[INST] {prompt} [/INST]"
do you have an idea ?
thanks
Beta Was this translation helpful? Give feedback.
All reactions