-
Notifications
You must be signed in to change notification settings - Fork 1.1k
llama-cpp-python 0.3.8 with CUDA #2010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You can try to compile the new code I maintain here: https://github.com/JamePeng/llama-cpp-python, but I only pre-compiled the Windows and linux version based on the recent code |
I probably misunderstand your question, but I am using 0.3.9 with CUDA via pip. I can use Qwen3 related models as well (using Iquants for example or flash attention, is that what you are referring to?). Sorry if my comment is useless. I built using... |
Hi, I think it might come from the fact that wheels have not been updated to recent version of llama-cpp-python. You can see that of this link https://abetlen.github.io/llama-cpp-python/whl/cu122/llama-cpp-python/. Last version available is 3.4. Could you add newer wheels @abetlen ? |
@m-from-space i have tried your solution and I got this error:
|
Man... I've tried for hours to get similar commands to work, never succeeded, I used JamePeng's wheel to make up for it. |
When will we have a recent version of llama-cpp-python, functional with CUDA, via pip?
It's a real nightmare to make it work any other way.
0.3.4 works with CUDA, but it doesn't take into account models like Qwen 3 quantified.
with kind regards
The text was updated successfully, but these errors were encountered: