Skip to content

Calls to decode slow down over time during parallel generation #3629

Answered by ggerganov
KerfuffleV2 asked this question in Q&A
Discussion options

You must be logged in to vote

shared prompt of 1,230 tokens:
At Last decode[40] we're dealing with (1230 + 40) * 8 = 10,160 tokens worth of context.

If the prompt is shared then the context is just 1230 + 40*8

The computation increases with the sequence length since the KQ and KQV operations grow with the number of tokens.
The current KV cache implementation computes KQ and KQV for all batches in a single pass by masking the attention respectively. The benefit of this is that we avoid the overhead from splitting the batch into separate attention streams and launching multiple kernels. The drawback is that we go through some extra cross-sequence computations that are technically not needed and are being thrown away b…

Replies: 1 comment 4 replies

Comment options

You must be logged in to vote
4 replies
@KerfuffleV2
Comment options

KerfuffleV2 Oct 15, 2023
Collaborator Author

@ggerganov
Comment options

@KerfuffleV2
Comment options

KerfuffleV2 Oct 15, 2023
Collaborator Author

@ggerganov
Comment options

Answer selected by KerfuffleV2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants