Replies: 2 comments 4 replies
-
Sorry, I don't follow. What specifically do you mean by "pairwise occurences"? In any case, I think the next step in optimization wouldbe to adapt the kernels in such a way that they can directly handle the Mixtral data layout without first making the data contiguous. As of right now I am trying to make the mul_mat_q kernels faster than FP16 cuBLAS GEMM since we can modify those as needed. |
Beta Was this translation helpful? Give feedback.
-
Yes, you want to work on as many tokens at the same time as possible because that way you have to load the weight matrices fewer times per token so you are less I/O bound.
The whole model is kept in RAM/VRAM. If parts of the model had to be loaded from disk the performance would be utterly unusable. So the way to speed things up is to instead utilize caches or shared memory (fast on-chip GPU memory). The biggest caches that I can think of is something like AMD's 7800X3D with 96 MB of cache. And even that is smaller than the smallest models by at least a factor of ~100. So keeping specific layers in cache just isn't viable in terms of speeding up the model as a whole. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Right now, my understanding of how CPU layers handle matrix multiplication for each expert in a MoE (for prompt processing / evaluation):
Could you in theory sort of the order of expert operations so that it optimizes the order in terms of pairwise occurences, and would this be any better / more optimal than just grouping per expert across all layers in the current batch (as it does right now)?
As in, after the router, you sort the order of experts by pairwise occurences, but you still do one expert at a time.
My thinking is that this would lead to more optimal memory access patterns / caching, at least for the CPU inference, but it might be pointless / a micro-optimization.
Any thoughts @JohannesGaessler ?
Even during generation time, some experts are accessed more rarely per token (this code is not tracking pair-wise occurences, it's just counting how many times in total each expert is accessed per token):
I'll try to look into counting pairwise occurences but it might be beyond my ability
Beta Was this translation helpful? Give feedback.
All reactions