Performance on CPUs with less than two cores #464
-
I recently tried to implement the Gemma-3-3B-QAT (GGUF) model and other much smaller ones such as Qwen-2.5-0.5B (GGUF) on cloud servers (GCP and DO), but I only had reasonable performance when the CPU had two or more cores or 4vCPUs. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
You need to have at least one CPU core not used for inference (that doesn't perform all the matrix multiplications needed for inference, etc.) in order for the OS to be able to use it to control the inference process (and other OS-related tasks). It depends on the specifics of the CPU you use, but limiting the number of threads used for inference might decrease the context switching done by the CPU and thus improve the performance. If you use a GPU for inference and offload all the layers to it, then a single CPU core will suffice, as it won't be used for the intensive calculations. |
Beta Was this translation helpful? Give feedback.
Good morning @giladgd.
I hope you are well.
The goal of my project is to use VMs using only CPUs (no GPU).
Yes, I tested reducing the number of threads (defined in getLlama() ) and if the number of threads is greater than the number of vCPUs, the inference processing time increases considerably. Another observation made is that, by default, the number of threads is set to 4, so when I tested it on a VM with 2vCPUs this caused this delay, since the ideal number should be 2.
2vCPUs / 2GB
2025-06-23T13:48:43.509Z Loading model: /var/projects/api-ai/models/hf_Qwen_Qwen3-0.6B.Q8_0.gguf
Number of threads used: 1
2025-06-23T13:48:46.160Z User: What is your name?
2025-06-23T13:48:57.658Z AI: I do…