Calls to decode slow down over time during parallel generation #3629

KerfuffleV2 · 2023-10-15T06:55:33Z

KerfuffleV2
Oct 15, 2023
Collaborator

is this expected? I assume it is, but I just thought I'd ask. The effect doesn't seem noticeable with single sequence generation, but the total sequence lengths involved are also a lot smaller.

For example, running a 70B model with 8 parallel sequences, shared prompt of 1,230 tokens:

-- Last decode[40]: 2.516, avg: 2.585
-- Last decode[80]: 2.554, avg: 2.568
-- Last decode[280]: 2.739, avg: 2.654
-- Last decode[440]: 3.018, avg: 2.722
-- Last decode[2320]: 5.907, avg: 4.029

Output from my hacked version of main in #2593 (AKA simple-inference) which dumps out the sequences + some stats every 40 calls to decode. The number in square brackets is the number of calls to decode so far, it's followed by the time the last decode took and finally the average so far.

At Last decode[40] we're dealing with (1230 + 40) * 8 = 10,160 tokens worth of context. At Last decode[440] that's up to 13,360, at [2320] it's 28,400 and the time each call takes has more than doubled. (The average is still outperforming single sequence generation though.) Even after generating only 440 tokens into each sequence (3,520 total) the time to decode increased by about half a second from ~2.5sec to ~3sec.

Answered by ggerganov

Oct 15, 2023

shared prompt of 1,230 tokens:
At Last decode[40] we're dealing with (1230 + 40) * 8 = 10,160 tokens worth of context.

If the prompt is shared then the context is just 1230 + 40*8

The computation increases with the sequence length since the KQ and KQV operations grow with the number of tokens.
The current KV cache implementation computes KQ and KQV for all batches in a single pass by masking the attention respectively. The benefit of this is that we avoid the overhead from splitting the batch into separate attention streams and launching multiple kernels. The drawback is that we go through some extra cross-sequence computations that are technically not needed and are being thrown away b…

View full answer

ggerganov · 2023-10-15T08:24:16Z

ggerganov
Oct 15, 2023
Maintainer

shared prompt of 1,230 tokens:
At Last decode[40] we're dealing with (1230 + 40) * 8 = 10,160 tokens worth of context.

If the prompt is shared then the context is just 1230 + 40*8

The computation increases with the sequence length since the KQ and KQV operations grow with the number of tokens.
The current KV cache implementation computes KQ and KQV for all batches in a single pass by masking the attention respectively. The benefit of this is that we avoid the overhead from splitting the batch into separate attention streams and launching multiple kernels. The drawback is that we go through some extra cross-sequence computations that are technically not needed and are being thrown away by the masking. Depending on the specific use-case, this might or might not be optimal, but my hypothesis is that when the shared prompt dominates in size (e.g. large system prompt, speculative decoding, short code completion, etc), this approach should be quite efficient because the "cross-talk" wouldn't be significant. But this is still something to be studied in more details (#3479).

4 replies

KerfuffleV2 Oct 15, 2023
Collaborator Author

when the shared prompt dominates in size (e.g. large system prompt, speculative decoding, short code completion, etc), this approach should be quite efficient

Thanks for the answer. By "dominates in size", do you mean compared to tokens generated into an individual sequence or the total generated into all sequences? Because I think you could consider 1,200 a pretty long prompt but with 8 sequences and only tokens 300 generated into them (i.e. 300 calls to decode with full batches) the number of generated tokens is double the prompt size (overall, not per sequence).

Or are you saying the scenario I'm talking about is already the optimal case?

ggerganov Oct 15, 2023
Maintainer

The larger the shared prompt is, the less drawback the cross-talk will have on the performance compared to some optimal solution that does not perform the masked computations.

KerfuffleV2 Oct 15, 2023
Collaborator Author

I don't think it's making a lot of difference in this case then, since I can see the decode time rising steadily even when the shared prompt is a lot larger than the tokens in each sequence (I'm still not completely clear whether you were talking about the prompt size compared to the total number of generated tokens or the tokens in each sequence).

If you meant large in comparison to tokens in individual sequences, then this also might be an indication that you won't be able to squeeze out much more additional performance even if there was special handling for the prompt smaller than tokens generated into each sequence scenario.

ggerganov Oct 15, 2023
Maintainer

I don't think it's making a lot of difference in this case then

Yes, I don't have a good estimate of the impact that this cross-talk has, but I'm hoping that it is small.

If you meant large in comparison to tokens in individual sequences, then this also might be an indication that you won't be able to squeeze out much more additional performance even if there was special handling for the prompt smaller than tokens generated into each sequence scenario.

This is my expectation as well based on some anecdotal evidence, but we should do a more scientific study and probably implement alternative approaches to compare the results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Calls to decode slow down over time during parallel generation #3629

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Calls to decode slow down over time during parallel generation #3629

Uh oh!

KerfuffleV2 Oct 15, 2023 Collaborator

Replies: 1 comment · 4 replies

Uh oh!

ggerganov Oct 15, 2023 Maintainer

Uh oh!

KerfuffleV2 Oct 15, 2023 Collaborator Author

Uh oh!

ggerganov Oct 15, 2023 Maintainer

Uh oh!

KerfuffleV2 Oct 15, 2023 Collaborator Author

Uh oh!

ggerganov Oct 15, 2023 Maintainer

KerfuffleV2
Oct 15, 2023
Collaborator

Replies: 1 comment 4 replies

ggerganov
Oct 15, 2023
Maintainer

KerfuffleV2 Oct 15, 2023
Collaborator Author

ggerganov Oct 15, 2023
Maintainer

KerfuffleV2 Oct 15, 2023
Collaborator Author

ggerganov Oct 15, 2023
Maintainer