Replies: 1 comment 5 replies
-
Do you see your GPU processing? Have you tried adding |
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I'm using
llm-ls
in NeoVim to use a model for completion / copilot like. Here: https://github.com/huggingface/llm-lsI tried many ways to use the provided
server
from llama.cpp, but nothing works as expected.The only server that works is the python binding.
So, using this is OK:
(note that -1 is nice to load all the layers in GPU, but OK there are 31 layers to load for the model)
But using llama.cpp like this fails:
With this command, the plugin calls the /v1/completions endpoint, and it takes a long time... after a moment, I sometimes get an error in NeoVim saying tha the reponse is not correct (JSON problem). I've got logs in the terminal where the server is launched, but nothing relevant.
It never completes or proposes something in the editor. While it works with the python server.
I tried many options,
--cont-batching
or-n 1024
, I really don't find the right way to start it correctly.It's not a big issue, as the python binding is OK. But, I'm curious to understand what fails with the provided server from the base repository.
Thanks for your help !
Beta Was this translation helpful? Give feedback.
All reactions