Python - Pre-compiled CFFI module for CPU and CUDA #8633

mtasic85 · 2024-07-22T20:57:12Z

Pre-compiled Python binding for llama.cpp using cffi. Supports CPU and CUDA 12.5 execution. Binary distribution which does not require complex compilation steps.

Installation is as simple and fast as pip install llama-cpp-cffi .

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

slaren · 2024-07-22T21:31:09Z

Considering that does not expose the llama.cpp API at all, and only provides a very light interface to the main example, I wouldn't call this bindings. Normally we have been very lax on the quality requirements of the linked projects, but putting this on the same level as the excellent python bindings provided by @abetlen does not seem right me.

mtasic85 · 2024-07-23T06:23:11Z

I believe that beauty and elegance of of main/llama-cli example is reason why most of people who use llama.cpp are here. One command line invocation in terminal and you get what you need.

My goal is not to cover every function call from ggml/llama.cpp but just enough to replicate the same experience of main/llama-cli for the developers. If a developer can recognize and use any command argument, and then know that with exactly same function parameter/argument they can call into python library and get the same result with minimal friction, then we achieved our goal.

Another problem is requiring build toolchains just to install Python library is already an huge issue. I believe that most of people stick (or at least begin with) to CPU-only version because setting up build environment is just not easy. I enjoy doing that, but I am not sure about others.

My next goals are going to be support of more pre-compiled backends for more operating systems and platforms. Except that, goal is not to increase complexity of library by wrapping more function calls and then manually replicating already great main/llama-cli example.

I hope these arguments are enough for considering llama-cpp-cffi as another Python option.

ngxson · 2024-07-24T11:40:18Z

Please @mtasic85 , listen to collaborators and project owner instead of spamming with new PRs.

This PR and #8339 (and also tangledgroup/llama-cpp-wasm) are hacky solutions and they should not exist at all. We never want llama-cli to be final product, but rather a way to showcase for developers how to use APIs provided by llama.cpp

If a developer can recognize and use any command argument, and then know that with exactly same function parameter/argument they can call into python library and get the same result with minimal friction, then we achieved our goal.

I don't get the point. Then, what's wrong with llama-cpp-python ? Just less than 10 lines of code:

from llama_cpp import Llama

llm = Llama(
      model_path="./models/7B/llama-model.gguf",
)
output = llm(
      "Q: Name the planets in the solar system? A: ", # Prompt
      max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)

Another problem is requiring build toolchains just to install Python library is already an huge issue. I believe that most of people stick (or at least begin with) to CPU-only version because setting up build environment is just not easy. I enjoy doing that, but I am not sure about others.

FYI, if you browse through some issues, most people use GPU, not CPU-only like your assumption.

I hope these arguments are enough for considering llama-cpp-cffi as another Python option.

If you want to make good contributions on behalf of Tangled Group, please considering looking at llama.cpp's roadmap. There are many more important things worth taking time.

mtasic85 · 2024-07-24T15:26:12Z

Please @mtasic85 , listen to collaborators and project owner instead of spamming with new PRs.

Hey @ngxson we are just exploring ways how to use llama.cpp, we do not need to agree on everything. I respect different opinions and suggestions. I don't agree that I spammed anyone. We are exchanging arguments, that is all.

This PR and #8339 (and also tangledgroup/llama-cpp-wasm) are hacky solutions and they should not exist at all. We never want llama-cli to be final product, but rather a way to showcase for developers how to use APIs provided by llama.cpp

I agree that it was hacky solution, but then you got inspired by that project and created even better https://github.com/ngxson/wllama .

I don't get the point. Then, what's wrong with llama-cpp-python ? Just less than 10 lines of code:

Nothing is wrong, this is just different approach. llama-cpp-python is well established project, but we do not use it in production. What we use in production for about an year is compiled llama.cpp (main/llama-cli) running as subprocess controlled by python main process. For our scenarios it has been great. After any issue with llama.cpp or model, we just kill subprocess, release memory, restart process and keep going.

FYI, if you browse through some issues, most people use GPU, not CPU-only like your assumption.

GPUs are favored, but from convince perspective, I see way more CPU usage around my circles. Anyway, this would be valuable statistics to know.

If you want to make good contributions on behalf of Tangled Group, please considering looking at llama.cpp's roadmap. There are many more important things worth taking time.

I check it often, but I really need to test llama.cpp in production environments. I like the project so much and evolution of it, so I can only commit to testing it.

ngxson · 2024-07-24T16:17:49Z

I think there are just too many red flags for me to call this a spam. Let me walk through one by one:

Why the project must be under https://github.com/tangledgroup and not under your own account?
Why on build example/main.cpp as shared library and intercept token printing using FFI #8339 you proposed exactly the same thing. @ggerganov already said that this is not an intended usage of llama-cli. Why you still open open a new PR with exactly the same idea?
I had to create my own wasm binding because your tangledgroup/llama-cpp-wasm is both hacky and unmaintained. Why don't you spend time to maintain it? What will prevent you from stopping to maintain llama-cpp-cffi in the next 3 months? This is basically a trust issue for me.

Then, to address your arguments about production-ready stuffs:

I agree that it was hacky solution

What we use in production for about an year is compiled llama.cpp (main/llama-cli) running as subprocess controlled by python main process.

Firstly, llama-cli is not production-ready. It is literally an example, so your usage is not expected.

Why did you agree that this is a hacky solution but then proceed to use it in production?

Your usage of llama-cli is extremely inefficient, because each time you run llama-cli, the model need to be reloaded, memory is reallocated, etc. In the other hand, llama-cpp-python is already prod-ready, well-maintained and much more efficient than your hack.

I check it often, but I really need to test llama.cpp in production environments.

Again: llama-cpp-python is prod-ready. Why don't you just use that?

so I can only commit to testing it.

Why? You can literally compile the project or to try using llama-cpp-python on your computer to play around and build on top of that. Why you "can only commit to testing it?"

Python (Pre-compiled CFFI module for CPU and CUDA)

7e492b3

mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Python - Pre-compiled CFFI module for CPU and CUDA #8633

Python - Pre-compiled CFFI module for CPU and CUDA #8633

Uh oh!

mtasic85 commented Jul 22, 2024

Uh oh!

slaren commented Jul 22, 2024

Uh oh!

mtasic85 commented Jul 23, 2024

Uh oh!

ngxson commented Jul 24, 2024 •

edited

Loading

Uh oh!

mtasic85 commented Jul 24, 2024

Uh oh!

ngxson commented Jul 24, 2024 •

edited

Loading

Uh oh!

Uh oh!

Python - Pre-compiled CFFI module for CPU and CUDA #8633

Are you sure you want to change the base?

Python - Pre-compiled CFFI module for CPU and CUDA #8633

Uh oh!

Conversation

mtasic85 commented Jul 22, 2024

Uh oh!

slaren commented Jul 22, 2024

Uh oh!

mtasic85 commented Jul 23, 2024

Uh oh!

ngxson commented Jul 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mtasic85 commented Jul 24, 2024

Uh oh!

ngxson commented Jul 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ngxson commented Jul 24, 2024 •

edited

Loading

ngxson commented Jul 24, 2024 •

edited

Loading