Replies: 2 comments
-
Are you sure RTX 4070 supports Flash Attention? |
Beta Was this translation helpful? Give feedback.
0 replies
-
I think you can change the |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I'm using a 4070 12G and 32G DDR5 ram. This is the command I use:
.\build\bin\llama-server.exe -m D:\llama.cpp\models\Qwen3-30B-A3B-UD-Q3_K_XL.gguf -c 32768 --port 9999 -ngl 99 --no-webui --device CUDA0 -fa -ot ".ffn_.*_exps.=CPU"
And for long prompts it takes over a minute to process:
Is there any approach to increase prompt processing speed? Only use ~5G vram, so I suppose there's room for improvement.
Beta Was this translation helpful? Give feedback.
All reactions