Config main an compilate for TESLA P40 card #7467

gandolfi974 · 2024-05-22T14:53:38Z

gandolfi974
May 22, 2024

hi, i have a Tesla p40 card, it's slow with ollama and Mixtral 8x7b.
Someone advise me to test compiling llama.cpp with "-DLLAMA_CUDA=ON -DLLAMA_CLBLAST=ON -DLLAMA_CUDA_FORCE_MMQ=ON" in order to use FP32 and acceleration on this old cuda card.
it's faster than ollama but i can't use it for conversation. i talk alone and close. i use this command

main.exe -ngl 29 -m D:\GITHUB\llamamodel.huggingface\mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "[INST] {prompt} [/INST]"

do you have an idea ?

thanks

jabberjabberjabber · 2024-05-25T01:16:14Z

jabberjabberjabber
May 25, 2024

That model is too big to load fully into VRAM. You need a smaller quant so you can fit all layers and KV cache on the card, otherwise you are partly using the CPU which is making it slow. Find a quant that fits in 24GB and leave enough room for context.

0 replies

gandolfi974 · 2024-05-26T22:52:40Z

gandolfi974
May 26, 2024
Author

thanks. Smaller quant like q3 or q2 are to bad no ?. I think Mixtral model can't be used maybe i have a 24go Vram card.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Config main an compilate for TESLA P40 card #7467

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Config main an compilate for TESLA P40 card #7467

Uh oh!

gandolfi974 May 22, 2024

Replies: 2 comments

Uh oh!

jabberjabberjabber May 25, 2024

Uh oh!

gandolfi974 May 26, 2024 Author

gandolfi974
May 22, 2024

jabberjabberjabber
May 25, 2024

gandolfi974
May 26, 2024
Author