Questions about order of MoE matmul for CPU/GPU inference #4628

kalomaze · 2023-12-25T13:18:31Z

kalomaze
Dec 25, 2023

Right now, my understanding of how CPU layers handle matrix multiplication for each expert in a MoE (for prompt processing / evaluation):

Each (current) layer of each token is evaluated by the router across the entire batch
Topk=2 is chosen (for Mixtral) to be evaluated for each
It groups together the matmuls for a single expert across all tokens in the current batch that use that expert and goes one expert at a time

Could you in theory sort of the order of expert operations so that it optimizes the order in terms of pairwise occurences, and would this be any better / more optimal than just grouping per expert across all layers in the current batch (as it does right now)?
As in, after the router, you sort the order of experts by pairwise occurences, but you still do one expert at a time.

My thinking is that this would lead to more optimal memory access patterns / caching, at least for the CPU inference, but it might be pointless / a micro-optimization.

Any thoughts @JohannesGaessler ?

Even during generation time, some experts are accessed more rarely per token (this code is not tracking pair-wise occurences, it's just counting how many times in total each expert is accessed per token):

Generating (1 / 16 tokens)
0: 8  1: 7  2:10  3:10  4:12  5: 4  6: 5  7: 8
Generating (2 / 16 tokens)
0:11  1: 6  2:13  3: 7  4: 5  5:10  6: 7  7: 5
Generating (3 / 16 tokens)
0:10  1: 8  2: 7  3: 9  4: 8  5: 6  6:12  7: 4
Generating (4 / 16 tokens)
0: 9  1: 7  2: 1  3: 8  4:11  5: 9  6: 9  7:10
Generating (5 / 16 tokens)
0: 9  1: 9  2: 7  3: 9  4:11  5: 6  6: 8  7: 5
Generating (6 / 16 tokens)
0: 7  1: 9  2:12  3: 6  4:10  5: 6  6: 6  7: 8
Generating (7 / 16 tokens)
0: 9  1: 5  2: 4  3:11  4: 5  5:10  6:12  7: 8
Generating (8 / 16 tokens)
0: 8  1:10  2: 7  3: 8  4: 4  5: 6  6:11  7:10
Generating (9 / 16 tokens)
0:11  1: 9  2: 5  3: 7  4: 6  5:10  6: 6  7:10
Generating (10 / 16 tokens)
0:10  1: 9  2: 7  3: 8  4:11  5: 9  6: 4  7: 6
Generating (11 / 16 tokens)
0: 6  1:12  2: 8  3: 6  4:10  5: 6  6: 9  7: 7
Generating (12 / 16 tokens)
0: 5  1:13  2: 9  3: 6  4:10  5:11  6: 6  7: 4
Generating (13 / 16 tokens)
0:10  1:12  2: 9  3: 7  4:10  5: 8  6: 5  7: 3
Generating (14 / 16 tokens)
0: 7  1: 6  2:10  3: 4  4:11  5:14  6: 6  7: 6
Generating (15 / 16 tokens)
0:10  1: 9  2:12  3: 6  4: 7  5: 5  6: 7  7: 8

I'll try to look into counting pairwise occurences but it might be beyond my ability

JohannesGaessler · 2023-12-25T14:17:30Z

JohannesGaessler
Dec 25, 2023
Collaborator

Sorry, I don't follow. What specifically do you mean by "pairwise occurences"? In any case, I think the next step in optimization wouldbe to adapt the kernels in such a way that they can directly handle the Mixtral data layout without first making the data contiguous. As of right now I am trying to make the mul_mat_q kernels faster than FP16 cuBLAS GEMM since we can modify those as needed.

2 replies

kalomaze Dec 25, 2023
Author

What specifically do you mean by "pairwise occurences"?

My understanding of how MoE inference is laid out for prompt processing:

The current batch has the MoE router calculations come first for all tokens per layer (I assume MoE routing happens per layer as I've been told that's how it is implemented for Mixtral)
Each of the layers have topk=2 experts selected based on the router scores.

So for example, if processing a batch size of let's say 5 tokens, at the start that might look like:

Token 1, Layer 1: [0, 5]
Token 2, Layer 1: [1, 7]
Token 3, Layer 1: [1, 7]
Token 4, Layer 1: [2, 3]
Token 5, Layer 1: [1, 5]

The idea is, instead of doing evaluation in order of 0, 1, 2, 3, 4, 5, 6, 7 for Mixtral (this is my assumption that it is done based on an arbitrary order based on expert ID), you would sort the order of expert calculations by most common pairs (or even just by total expert count across the current batch).
So you would do expert 1 calculations first, and then expert 7, because it would 'complete' as many layers as possible in the current batch.
My assumption is, the layer for any individual token cannot proceed to the next hidden state (is that the right term?) until both experts have been calculated, so optimizing the ordering of experts to finish as many token groups as possible could be valuable because there would be less 'hold up'.

I am operating off a lot of assumptions because CUDA parallelism and etc is new to me but I'm not sure if this is how it actually works.

kalomaze Dec 25, 2023
Author

I am told that you cannot proceed any tokens to the next hidden layer because keeping all tokens in the batch on the 'same layer' is better for parallelism for both CPU and GPU.
This would make my idea conceptually irrelevant, but then I thought: would it make sense to compute the experts in order of 'last accessed' (in cases where the whole MoE model is larger than the full system and has to be mmap'd in dynamically) to avoid pagefiling as often?

JohannesGaessler · 2023-12-25T17:38:54Z

JohannesGaessler
Dec 25, 2023
Collaborator

I am told that you cannot proceed any tokens to the next hidden layer because keeping all tokens in the batch on the 'same layer' is better for parallelism for both CPU and GPU.

Yes, you want to work on as many tokens at the same time as possible because that way you have to load the weight matrices fewer times per token so you are less I/O bound.

This would make my idea conceptually irrelevant, but then I thought: would it make sense to compute the experts in order of 'last accessed' (in cases where the whole MoE model is larger than the full system and has to be mmap'd in dynamically) to avoid pagefiling as often?

The whole model is kept in RAM/VRAM. If parts of the model had to be loaded from disk the performance would be utterly unusable. So the way to speed things up is to instead utilize caches or shared memory (fast on-chip GPU memory). The biggest caches that I can think of is something like AMD's 7800X3D with 96 MB of cache. And even that is smaller than the smallest models by at least a factor of ~100. So keeping specific layers in cache just isn't viable in terms of speeding up the model as a whole.

2 replies

kalomaze Dec 25, 2023
Author

I see.

I think something that might help the most significantly with MoE inference is related to how the experts are offloaded.
Right now, it uses an equal amount of layers for all experts. (I.e. n_gpu_layers = how many layers of every expert are offloaded).
It seems to me it would be wiser to split those layers into smaller groups so you can fit as many actual layers into VRAM as possible, with the added benefit of not having to communicate with the CPU for the fully offloaded experts.

#4518 (comment)

JohannesGaessler Dec 25, 2023
Collaborator

Yes, if specific experts are selected more frequently than other experts then prioritizing them with offloading should be beneficial. But I would intuitively assume that for MoE the optimal configuration after training is to utilize all experts evenly.

Questions about order of MoE matmul for CPU/GPU inference #4628

Uh oh!

Uh oh!

kalomaze Dec 25, 2023

Replies: 2 comments · 4 replies

Uh oh!

JohannesGaessler Dec 25, 2023 Collaborator

Uh oh!

Uh oh!

kalomaze Dec 25, 2023 Author

Uh oh!

Uh oh!

kalomaze Dec 25, 2023 Author

Uh oh!

JohannesGaessler Dec 25, 2023 Collaborator

Uh oh!

kalomaze Dec 25, 2023 Author

Uh oh!

JohannesGaessler Dec 25, 2023 Collaborator

kalomaze
Dec 25, 2023

Replies: 2 comments 4 replies

JohannesGaessler
Dec 25, 2023
Collaborator

kalomaze Dec 25, 2023
Author

kalomaze Dec 25, 2023
Author

JohannesGaessler
Dec 25, 2023
Collaborator

kalomaze Dec 25, 2023
Author

JohannesGaessler Dec 25, 2023
Collaborator