Performance on CPUs with less than two cores #464

areumtecnologia · 2025-05-19T13:02:02Z

areumtecnologia
May 19, 2025

I recently tried to implement the Gemma-3-3B-QAT (GGUF) model and other much smaller ones such as Qwen-2.5-0.5B (GGUF) on cloud servers (GCP and DO), but I only had reasonable performance when the CPU had two or more cores or 4vCPUs.
Why is the inference performance on CPUs with less than 2 cores (4vCPUs) so low and, if there is one, what is the solution for this?

Answered by areumtecnologia

Jun 23, 2025

Good morning @giladgd.
I hope you are well.

The goal of my project is to use VMs using only CPUs (no GPU).
Yes, I tested reducing the number of threads (defined in getLlama() ) and if the number of threads is greater than the number of vCPUs, the inference processing time increases considerably. Another observation made is that, by default, the number of threads is set to 4, so when I tested it on a VM with 2vCPUs this caused this delay, since the ideal number should be 2.

2vCPUs / 2GB

2025-06-23T13:48:43.509Z Loading model: /var/projects/api-ai/models/hf_Qwen_Qwen3-0.6B.Q8_0.gguf
Number of threads used: 1
2025-06-23T13:48:46.160Z User: What is your name?
2025-06-23T13:48:57.658Z AI: I do…

View full answer

giladgd · 2025-05-19T21:08:21Z

giladgd
May 19, 2025
Maintainer

You need to have at least one CPU core not used for inference (that doesn't perform all the matrix multiplications needed for inference, etc.) in order for the OS to be able to use it to control the inference process (and other OS-related tasks).
Trying to do both on the same CPU core will cause it to do frequent context switching which is the main cause for the slowdown you experienced.
It may be more efficient for you to run a single machine/pod with 3 CPU cores to do 2 inference jobs in parallel than having 3 machines/pods with a single CPU core each doing a single inference job.

It depends on the specifics of the CPU you use, but limiting the number of threads used for inference might decrease the context switching done by the CPU and thus improve the performance.
Personally, I haven't found this to solve the issue on the machines that I've tested this approach with, so let me know if limiting the number of threads helped you so I can add it to the documentation.

If you use a GPU for inference and offload all the layers to it, then a single CPU core will suffice, as it won't be used for the intensive calculations.

1 reply

areumtecnologia Jun 23, 2025
Author

Good morning @giladgd.
I hope you are well.

The goal of my project is to use VMs using only CPUs (no GPU).
Yes, I tested reducing the number of threads (defined in getLlama() ) and if the number of threads is greater than the number of vCPUs, the inference processing time increases considerably. Another observation made is that, by default, the number of threads is set to 4, so when I tested it on a VM with 2vCPUs this caused this delay, since the ideal number should be 2.

2vCPUs / 2GB

2025-06-23T13:48:43.509Z Loading model: /var/projects/api-ai/models/hf_Qwen_Qwen3-0.6B.Q8_0.gguf
Number of threads used: 1
2025-06-23T13:48:46.160Z User: What is your name?
2025-06-23T13:48:57.658Z AI: I don't have a name, but I can help you with that! If you have any questions, I can connect with you. 11.498 sec elapseds (with defaultContextFlashAttention)
21.028 sec elapseds (without defaultContextFlashAttention)

2025-06-23T13:49:17.629Z Loading model: /var/projects/api-ai/models/hf_Qwen_Qwen3-0.6B.Q8_0.gguf
Number of threads used: 2
2025-06-23T13:49:19.709Z User: What's your name?
2025-06-23T13:49:28.545Z AI: I don't have a name, but I can help you with that! If you have any questions, I can connect with you.
8.836 seconds (with defaultContextFlashAttention)
13.341 seconds (without defaultContextFlashAttention)

2025-06-23T13:49:46.881Z Loading model: /var/projects/api-ai/models/hf_Qwen_Qwen3-0.6B.Q8_0.gguf
Number of threads used: 3
2025-06-23T13:49:48.958Z User: What is your name?
// More than 10 minutes and did not complete.

Thanks for your answer

Answer selected by areumtecnologia

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Performance on CPUs with less than two cores #464

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Performance on CPUs with less than two cores #464

Uh oh!

Uh oh!

areumtecnologia May 19, 2025

Replies: 1 comment · 1 reply

Uh oh!

giladgd May 19, 2025 Maintainer

Uh oh!

Uh oh!

areumtecnologia Jun 23, 2025 Author

areumtecnologia
May 19, 2025

Replies: 1 comment 1 reply

giladgd
May 19, 2025
Maintainer

areumtecnologia Jun 23, 2025
Author