How to use a second system to run a draft model, and use it with the main model in the primary system? #12928

pramjana · 2025-04-13T10:36:27Z

pramjana
Apr 13, 2025

I've got two mac mins - one with 16GB RAM (M2 Pro), and the other with 8GB RAM (M2). Now, I was wondering if I can leverage the power of speculative decoding to speed up inference perforamance of a main model (like a Qwen2.5-Coder-14B 4bits quantized GGUF) on the M2 Pro mac, while having the draft model (like a Qwen2.5-Coder-0.5B 8bits quantized GGUF) running via the M2 mac. Is this feasible, perhaps using rpc-server? Can someone who's done something like this help me out please? Also, if this is possible, is it scalable even further (I have an old desktop with an RTX 2060).

A related discussion, which I couldn't quite understand happened earlier.

pramjana · 2025-04-14T03:59:59Z

pramjana
Apr 14, 2025
Author

I have a command that works well for speculative decoding on my system - llama-server --port 12394 -ngl 99 -c 4096 -fa -ctk q8_0 -ctv q8_0 --host 0.0.0.0 -md ./qwen2.5-coder-0.5b-instruct-q8_0.gguf --draft-max 24 --draft-min 1 --draft-p-min 0.8 --temp 0.1 -ngld 99 --parallel 2 -m ./qwen2.5-coder-7b-instruct-Q4_k_m.gguf.

Now, the question is, how can I offload the draft model to my other mac mini (M2)? I have doubts if this would end up benefitting me (I guess the draft model needs to speak with the main model quite frequently, and latency should be important; I'm not sure we get it with Ethernet or Thunderbolt 4). But, as in the case of any experiment, trying it out, and seeing how bad/good it actually is, would be worth it right?

I don't understand rpc-server much to be able to do this. Could someone kindly be able to provide me some commands to utilize rpc-server? The documentation on llama.cpp about rpc-server, and its use in combination with llama-cli and llama-server is quite insufficient, I think.

2 replies

ggerganov Apr 14, 2025
Maintainer

To do that, we have to add an extra CLI argument that controls the devices for the draft model, similar to the existing --rpc argument. It's not difficult to add and it's interesting to experiment with.

pramjana Apr 14, 2025
Author

It sounds great that it is possible!
Would really appreciate your help in implementing this, or helping us with an end to end example.

A big fan of your work on llama.cpp - thank you so much!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to use a second system to run a draft model, and use it with the main model in the primary system? #12928

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to use a second system to run a draft model, and use it with the main model in the primary system? #12928

Uh oh!

Uh oh!

pramjana Apr 13, 2025

Replies: 1 comment · 2 replies

Uh oh!

pramjana Apr 14, 2025 Author

Uh oh!

ggerganov Apr 14, 2025 Maintainer

Uh oh!

pramjana Apr 14, 2025 Author

pramjana
Apr 13, 2025

Replies: 1 comment 2 replies

pramjana
Apr 14, 2025
Author

ggerganov Apr 14, 2025
Maintainer

pramjana Apr 14, 2025
Author