You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Can somebody confirm to me that this should work by now also with offloading parts of the model to GPU and parts of the model to a CPU (RAM) based instance?
I tried the following with the model "mistral-7b-instruct-v0.1.Q4_K_M.gguf" (which I saw predownloaded in the models folder of my llama.cpp instance)
On a smartphone (4GB RAM) I installed llama.cpp via termux and ran the CPU variant (I hope my assumption is correct utilizing on the phone "cmake .. - DGGML_RPC=ON") via "bin/rpc-server -p 50052". (I also tried the same variant with a laptop of my which also has no CUDA support enabled).
On my desktop machine (12GB VRAM - RTX3060) I ran another instance of the llama.cpp binary with CUDA support.
Then I started he client to address both of them (one localhost, one remote host so to say, when specificying the ip addresses).
I was able to offload and execute the mistral model (although very slowly as expected).
When trying other models, that I had downloaded separately into the models folder, e.g. Mixtral8x7B also in the Q4_K_M variant, I was unable to perform the offloading.
The programm got stuck in "loading tensors" section and the connection to the two remote clients got terminated as well when reaching this point.
I would appreciate if somebody could tell me, what to look out for in terms of executing RPC distribution, that might go beyond the explanations in above README.md.
Does the feature only work with VRAM? Not to be mixed with RAM or not to run only on RAM?
Does one have to be carefull when selecting other models to execute?
Besides the mistral model, all included models of ollama had issue and didn't perform as well.
Thanks in advance for any feedback. Highly appreciated.
I am not 100% sure if this question has been already explained somewhere else. A quick search of mine in the issues and the forum, however didn't fully satisfy my understanding.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I came accross the neat RPC distribution feature of llama.cpp and wanted to give it a shot: https://github.com/ggerganov/llama.cpp/blob/master/examples/rpc/README.md
Can somebody confirm to me that this should work by now also with offloading parts of the model to GPU and parts of the model to a CPU (RAM) based instance?
I tried the following with the model "mistral-7b-instruct-v0.1.Q4_K_M.gguf" (which I saw predownloaded in the models folder of my llama.cpp instance)
On a smartphone (4GB RAM) I installed llama.cpp via termux and ran the CPU variant (I hope my assumption is correct utilizing on the phone "cmake .. - DGGML_RPC=ON") via "bin/rpc-server -p 50052". (I also tried the same variant with a laptop of my which also has no CUDA support enabled).
On my desktop machine (12GB VRAM - RTX3060) I ran another instance of the llama.cpp binary with CUDA support.
Then I started he client to address both of them (one localhost, one remote host so to say, when specificying the ip addresses).
I was able to offload and execute the mistral model (although very slowly as expected).
When trying other models, that I had downloaded separately into the models folder, e.g. Mixtral8x7B also in the Q4_K_M variant, I was unable to perform the offloading.
The programm got stuck in "loading tensors" section and the connection to the two remote clients got terminated as well when reaching this point.
I would appreciate if somebody could tell me, what to look out for in terms of executing RPC distribution, that might go beyond the explanations in above README.md.
Does the feature only work with VRAM? Not to be mixed with RAM or not to run only on RAM?
Does one have to be carefull when selecting other models to execute?
Besides the mistral model, all included models of ollama had issue and didn't perform as well.
Thanks in advance for any feedback. Highly appreciated.
I am not 100% sure if this question has been already explained somewhere else. A quick search of mine in the issues and the forum, however didn't fully satisfy my understanding.
Beta Was this translation helpful? Give feedback.
All reactions