-
Notifications
You must be signed in to change notification settings - Fork 12.4k
Python - Pre-compiled CFFI module for CPU and CUDA #8633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Considering that does not expose the llama.cpp API at all, and only provides a very light interface to the main example, I wouldn't call this bindings. Normally we have been very lax on the quality requirements of the linked projects, but putting this on the same level as the excellent python bindings provided by @abetlen does not seem right me. |
I believe that beauty and elegance of of My goal is not to cover every function call from Another problem is requiring build toolchains just to install Python library is already an huge issue. I believe that most of people stick (or at least begin with) to CPU-only version because setting up build environment is just not easy. I enjoy doing that, but I am not sure about others. My next goals are going to be support of more pre-compiled backends for more operating systems and platforms. Except that, goal is not to increase complexity of library by wrapping more function calls and then manually replicating already great I hope these arguments are enough for considering |
Please @mtasic85 , listen to collaborators and project owner instead of spamming with new PRs. This PR and #8339 (and also tangledgroup/llama-cpp-wasm) are hacky solutions and they should not exist at all. We never want
I don't get the point. Then, what's wrong with from llama_cpp import Llama
llm = Llama(
model_path="./models/7B/llama-model.gguf",
)
output = llm(
"Q: Name the planets in the solar system? A: ", # Prompt
max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)
FYI, if you browse through some issues, most people use GPU, not CPU-only like your assumption.
If you want to make good contributions on behalf of Tangled Group, please considering looking at llama.cpp's roadmap. There are many more important things worth taking time. |
Hey @ngxson we are just exploring ways how to use llama.cpp, we do not need to agree on everything. I respect different opinions and suggestions. I don't agree that I spammed anyone. We are exchanging arguments, that is all.
I agree that it was hacky solution, but then you got inspired by that project and created even better https://github.com/ngxson/wllama .
Nothing is wrong, this is just different approach. llama-cpp-python is well established project, but we do not use it in production. What we use in production for about an year is compiled llama.cpp (main/llama-cli) running as subprocess controlled by python main process. For our scenarios it has been great. After any issue with llama.cpp or model, we just kill subprocess, release memory, restart process and keep going.
GPUs are favored, but from convince perspective, I see way more CPU usage around my circles. Anyway, this would be valuable statistics to know.
I check it often, but I really need to test llama.cpp in production environments. I like the project so much and evolution of it, so I can only commit to testing it. |
I think there are just too many red flags for me to call this a spam. Let me walk through one by one:
Then, to address your arguments about production-ready stuffs:
Firstly, Why did you agree that this is a hacky solution but then proceed to use it in production? Your usage of
Again:
Why? You can literally compile the project or to try using |
Pre-compiled Python binding for llama.cpp using cffi. Supports CPU and CUDA 12.5 execution. Binary distribution which does not require complex compilation steps.
Installation is as simple and fast as
pip install llama-cpp-cffi
.