Replies: 10 comments 24 replies
-
In Improving performance on NUMA systems is something I would be interested in looking into, but I don't have a dual socket system available (with enough memory bandwidth to make it interesting), and I'm just a lonely guy hacking here for fun without the resources to go and rent/buy such a system. |
Beta Was this translation helpful? Give feedback.
-
There is actually a good discussion on mainline: ggml-org/llama.cpp#12088 They did test ik_llama.cpp (but in only with a single NUMA Node on a single CPU at Q8_0) where it still outperformed mainline for CPU only. Also you can look at zts9989's comment here where he talks about NUMA and what llama.cpp could improve on after he found that "approximately 50% of CPU usage is spent on thread synchronization" when running Deepseek R1 with multiple numa nodes. |
Beta Was this translation helpful? Give feedback.
-
Thanks for alerting me to this thread. They have tested the lowest performing configuration in ggml-org/llama.cpp#12088 (but this is also to be expected as I don't have any documentation on the new features, so one needs to go through the PRs to discover them). For instance, here is a table for DeepSeek-Lite
TG is a very different story. There performance is clearly dominated by memory access patterns and thread synchronization, and I cannot look into optimizing this aspect without having access to such a system. As it stands, the achieved performance is nowhere near the maximum theoretical performance. The tested 6980P has a theoretical bandwidth of 512? GiB/s, so 8X my Ryzen-7950X. I get Very interesting results, thank you for posting and including my little LLM inference playground in the results. I have seen a higher than usual amount of stars added to my repository in the last few days, I guess this must be due to your post. I'm curious which Playing with some of the more advanced options that mainline |
Beta Was this translation helpful? Give feedback.
-
@ubergarm (thought you might also be interested in this).
Someone has shared code that can duplicate the model for NUMA benefits on llama.cpp:
The downside of duplicating the model is pretty heavy, but this approach obviously avoids any non local memory access, and shows the upper bound on performance that that could be gained from other solutions that reduce or remove non local memory access. Looking at the codebase, I think it currently only works for dual socket nodes, and I would have been more interested in testing it but none of my machines (even the very unstable one quad socket 1 TB memory node that I haven't turned on in a long time) would have enough RAM to replicate my preferred quant of R1, I'd have to use one under 192 GB (I do still have my IQ1_S_R4 V2 that is 129 GB). |
Beta Was this translation helpful? Give feedback.
-
Why? |
Beta Was this translation helpful? Give feedback.
-
Sure, that would be if you wanted to squeeze out the last bit of performance. But we are not at that stage. Instead, we are a factor of 2 or more away from what should be possible. Having 2 big NUMA nodes would make the distribution of weights much easier: simply change the weight loading to use two threads, each pinned to a specific NUMA node, and each loading half of the tensor data. During inference pin half the threads to run on the 1st NUMA node, and the other half to the second NUMA node. My thinking is that this should give a significant boost in performance without replicating the model on both NUMA nodes. It is of course possible to do stuff such as this with several NUMA nodes, but it makes things way more complicated. So, I'm thinking that the 1st step should be to get better performance with 2 NUMA nodes. But if you are telling me that this is very far from ideal, and that the only way to get better performance is to enable and utilize all NUMA nodes, then it is a waste of time to implement the simple approach described above. |
Beta Was this translation helpful? Give feedback.
-
Oh I see a benchmark in the wild attempting to benchmark that vproxy-tools/llama.cpp NUMA data parallel code against ik fork: ggml-org/llama.cpp#12289 (comment)
Not sure the details of how they are running it though... |
Beta Was this translation helpful? Give feedback.
-
I currently settle for running my DeepSeek v3 model on just one NUMA / socket of my dual socket system. However, while investigating the draft models situation, it occurred to me that if should be relatively easy to specify cores for the main model (on one socket) and specify other cores (in my case on the other socket/NUMA node) for the draft model as communication between the two should be minimal. |
Beta Was this translation helpful? Give feedback.
-
On my dual socket machine using https://github.com/intel/pcm I found this is what it looks like during PP:
And during TG:
|
Beta Was this translation helpful? Give feedback.
-
just sharing i tried all snoop modes on my x99 dual board and got 200-300% boost vs stock bios settings, this setting is also available on xeon scalable fwiw stock bios
home snoop w/ dir OSB
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
It seems to me that output generation being memory bandwidth bounded and LLM requiring a lot of RAM , a cheap way to try increase both RAM amount and bandwidth is to go for NUMA.
For instance, a dual Epyc server can have 16 or 24 memory channels each CPU can also have up to 4 NUMA domains for best theoretical performance (also, on Gen 2 Epyc at least, L3 cache is shared only amongst cores on the same CCX).
However, there are many pitfalls to efficient NUMA programming especially to minimize cross NUMA domain memory and PCIe access.
It is my understanding that llama.cpp is trying to avoid the most basic problems (e.g. allocation everything in 1 NUMA domain) but more work needs to be done.
KTransformers just duplicates matrices on each NUMA domain !
vLLM can do tensor parallelism on NUMA : «In general each NUMA node is treated as one GPU card. »
Is ik_llama.cpp NUMA aware ? If not, are there plans to make it NUMA aware ?
Thx !
Beta Was this translation helpful? Give feedback.
All reactions