Question about chunked prefill #12145
wearegolden
announced in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
My understanding is that if i run my model with
enable_chunked_prefill=True
andmax_num_batched_tokens
the same as the max_length of my model, this is the same as no chunked prefill with decoding prioritized over prefills.So my assumption was that if i send requests 1 by 1, there would be no prioritizing to do since there is only 1 request to handle at a given time, thereby enabling/disabling chunked prefill should give the same results. However, this was not the case.
I did some digging into the code, and I'm suspecting this part is the reason for the difference in results.
Here, the
self.block_tables
results in an empty list when disabling chunked prefill, and a non-empty list when enabling chunked prefill.vllm/vllm/attention/backends/flash_attn.py
Lines 437 to 444 in a6221a1
This further leads to the codes diverging here, where the
block_table
argument for the functionflash_attn_varlen_func
differs.vllm/vllm/attention/backends/flash_attn.py
Lines 857 to 904 in a6221a1
Can anyone explain to me the reason chunked_prefill needs
block_tables
in the prefill phase? Thank you in advance.Beta Was this translation helpful? Give feedback.
All reactions