Understand the way to start the server with openAI support like does the python binding #7766

metal3d · 2024-06-05T09:20:23Z

metal3d
Jun 5, 2024

Hi,

I'm using llm-ls in NeoVim to use a model for completion / copilot like. Here: https://github.com/huggingface/llm-ls

I tried many ways to use the provided server from llama.cpp, but nothing works as expected.
The only server that works is the python binding.

So, using this is OK:

pyton -m llama_cpp.server --host 0.0.0.0 --port 8000 --n_gpu_layers -1 \
   --model <the model.gguf>

(note that -1 is nice to load all the layers in GPU, but OK there are 31 layers to load for the model)

But using llama.cpp like this fails:

./build/bin/server -m <model> --host 0.0.0.0 --port 8000 -ngl 31

With this command, the plugin calls the /v1/completions endpoint, and it takes a long time... after a moment, I sometimes get an error in NeoVim saying tha the reponse is not correct (JSON problem). I've got logs in the terminal where the server is launched, but nothing relevant.

It never completes or proposes something in the editor. While it works with the python server.

I tried many options, --cont-batching or -n 1024, I really don't find the right way to start it correctly.

It's not a big issue, as the python binding is OK. But, I'm curious to understand what fails with the provided server from the base repository.

Thanks for your help !

ggerganov · 2024-06-05T11:15:48Z

ggerganov
Jun 5, 2024
Maintainer

It never completes or proposes something in the editor.

Do you see your GPU processing?

Have you tried adding -c 0 to the arguments so that the server allocates a KV cache that is as big as the model's training context? The default until yesterday was just 512 tokens, so that might have cause problems in the generation

5 replies

metal3d Jun 5, 2024
Author

Yes, there is a huge GPU and CPU usage when the request is sent by the plugin. But, either the request is very long (more than 2 minutes and fails), or the responses is considered as "non valid JSON".

It works with the python server.

I will try the -c 0

metal3d Jun 5, 2024
Author

OK...

Now that clBLAST is removed, I cannot recompile. CUDA fails to compile, VULKAN craches on inference.

Anyway, the problem is the same using previous version. With or without -c 0 option. The plugin says that the server returns a bad JSON, while it works with Python server.

Hard to understand and see the returned JSON from server and python server...

PS: why removing OpenCL support ? that was my last chance to avoid using proprietary CUDA 😢

metal3d Jun 5, 2024
Author

Precision, this is what says the plugin using the llama.cpp server:

serde json error: data did not match any variant of untagged enum OpenAIAPIResponse

While it is OK with python server

ggerganov Jun 5, 2024
Maintainer

Hard to understand and see the returned JSON from server and python server...

You can start the server with --verbose to see more logs, could be helpful to understand what is going on

metal3d Jun 14, 2024
Author

Thanks @ggerganov :) I already tried but it is clearly not easy to find what's wrong even with the verbose output. I will need to investigate a lot to find:

what is sent by llm-ls to the server
what is the answer given by llamacpp server
check the differences with llamacpp python server (that is OK)

I will try to give more details and fixes if I can.

Anyway, Thanks a lot to have created llama.cpp!
Being able to run LLM on my old computer (which still has a 10-year-old corei7, an RTX 3070 FE and a fair amount of RAM, but no understanding of AVX2 instructions) was a huge relief. I was finally able to exploit this machine for ML work. Before that, no LLM would get through (apart from using the Petals project). So, I don't know how to thank you, but I repeat, this project is a real revolution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Understand the way to start the server with openAI support like does the python binding #7766

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Understand the way to start the server with openAI support like does the python binding #7766

Uh oh!

metal3d Jun 5, 2024

Replies: 1 comment · 5 replies

Uh oh!

ggerganov Jun 5, 2024 Maintainer

Uh oh!

metal3d Jun 5, 2024 Author

Uh oh!

metal3d Jun 5, 2024 Author

Uh oh!

metal3d Jun 5, 2024 Author

Uh oh!

ggerganov Jun 5, 2024 Maintainer

Uh oh!

metal3d Jun 14, 2024 Author

metal3d
Jun 5, 2024

Replies: 1 comment 5 replies

ggerganov
Jun 5, 2024
Maintainer

metal3d Jun 5, 2024
Author

metal3d Jun 5, 2024
Author

metal3d Jun 5, 2024
Author

ggerganov Jun 5, 2024
Maintainer

metal3d Jun 14, 2024
Author