How to (very slowly) run high-precision R1 quant on M2 Ultra with most of model being on SSD? #11680

Unanswered

okuvshynov asked this question in Q&A

okuvshynov
Feb 5, 2025

Setup:

192GB M2 Ultra
higher-precision R1 quant from https://huggingface.co/unsloth/DeepSeek-R1-GGUF

What is the right way to configure llama.cpp (either cli or server) to do roughly the following:

always keep KV cache in memory
keep model on SSD
ideally keep the routing layers (which would run on every token) in memory

It's fine if it it slow, I'm ok with 1-2 token / minute.

Thank you!

Replies: 0 comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment