[CB] Reduce wastage in prefill compute and pad blocks in homogeneous continuous batching #262

yannicks1 · 2025-06-24T13:31:44Z

[CB] Reduce wastage in prefill compute and pad blocks in homogeneous continuous batching

implement optimization idea by @JRosenkranz: do prefill only on next multiple of block size and then during decode pad with (valid) block id. Reduces computes for prefill and does not waist any valid blocks ids if whole blocks are padded to make tkv homogeneous.

solves #255

Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com>

github-actions · 2025-06-24T13:32:02Z

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint --active --inexact

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com>

Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>

yannicks1 · 2025-06-30T16:59:09Z

great news: This runs on Spyre 🎉

I just ran cb_spyre_inference.py which (with the parameters on this branch) exploits all functionality:

prefill seq 1 of size 128 (left pads)
prefill seq 2 of size 64 (left pads) (see [CB] Reduce wastage in prefill compute and pad blocks in homogeneous continuous batching #255)
decode of batch 2
strips fully padded blocks (see [CB] strip repeated left padding on batch level #131 )
prefill seq 3 by left padding to 66 (align tkv) and right padding to 128 (pad to block boundary)
decode of batch 2
decode of batch 1 (see [CB] add min batch size of 2 in decode #182)

cc: @tdoublep @JRosenkranz @joerunde @nikolaospapandreou @sducouedic

yannicks1 · 2025-07-01T14:47:49Z

bot:test
TEST_FILE=tests/e2e/test_spyre_cb.py MARKERS="spyre"

yannicks1 · 2025-07-04T15:01:11Z

bot:test
TEST_FILE=tests/e2e/test_spyre_cb.py MARKERS="spyre"

Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>

Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com>

yannicks1 · 2025-07-17T16:06:55Z

bot:test
TEST_FILE=tests/e2e/test_spyre_cb.py MARKERS="spyre"

yannicks1 · 2025-07-17T16:30:29Z

6/7 tests passed on the Spyre card! looks like the failure is a known issue unrelated to this PR. 🥳

Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com>

yannicks1 · 2025-07-22T17:48:33Z

bot:test
TEST_FILE=tests/e2e/test_spyre_cb.py MARKERS="spyre"

yannicks1 · 2025-07-23T07:46:14Z

bot:test
MARKERS="spyre"

yannicks1 · 2025-07-23T11:22:44Z

bot:test
TEST_FILE=tests/e2e/test_spyre_cb_scheduler_step.py MARKERS="spyre"

Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>

prashantgupta24 · 2025-07-23T17:56:41Z

examples/offline_inference/cb_spyre_inference.py

@@ -139,4 +139,4 @@
            print("-----------------------------------")

    if not any_differ:
-        print("\nAll results match!\n")
+        print("\nAll results match!\n")


nit (should revert changes to this file)

Suggested change

print("\nAll results match!\n")

print("\nAll results match!\n")

prashantgupta24 · 2025-07-23T18:58:02Z

vllm_spyre/v1/worker/spyre_model_runner.py

worth adding some debug logs to the optimizations?

maxdebayser · 2025-07-23T20:18:54Z

vllm_spyre/v1/worker/spyre_model_runner.py

+            # filling block table with padding blocks (reusing id 0)
+            blocks = self.req_ids2blocks[req_id].copy()
+            for i in range(n_blocks - len(self.req_ids2blocks[req_id])):
+                blocks.appendleft(0)


Perhaps this is an incredibly stupid question, but why is it ok to use block id 0? Does it make a difference if it free (i.e. it's in self.block_pool) or not?

Based on Josh's comment in an internal issue,

"The only requirement when padding this is that we choose a block ID that exists in the pool of allotted block ids at server start (this way it will map to a real location in the memory space). In this case, when performing paged attention compute, the placeholder block will be part of compute (we will still take a performance hit during decode - at least until heterogeneous tkv is available), but will not take up any extra space as part of the block allotment at the time of server start."

IIUC I think that's why block 0 makes sense?

Although yeah I'm not sure what will happen if block 0 is actually being used by another request?

I managed to create a scenario where the block table looks like this for 2 requests:

block table: [deque([0, 2]), deque([0, 4])]

From what I can see the output still is correct, although anyone looking at the block table could be confused

prashantgupta24 · 2025-07-23T21:44:35Z

vllm_spyre/v1/worker/spyre_model_runner.py

@@ -884,8 +892,10 @@ def _prepare_prompt(
        # applies left padding to align with tkv of current decode batch
        # and right padding to align with the next block boundary
        input_tokens, position_ids, mask =\


Still processing the changes, but wondering if the comment above needs rewording...

first implementation of optimization

7df971e

Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com>

fix adding new blocks

5e1d468

Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com>

yannicks1 mentioned this pull request Jun 23, 2025

[CB] Reduce wastage in prefill compute and pad blocks in homogeneous continuous batching #255

Open

Merge branch 'main' into ysc-homog-tkv-opt-joshua

c8a33de

yannicks1 self-assigned this Jun 26, 2025

make mask contiguous

49d92f5

Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com>

yannicks1 force-pushed the ysc-homog-tkv-opt-joshua branch 2 times, most recently from a7e7ae9 to 49d92f5 Compare June 27, 2025 22:40

yannicks1 and others added 3 commits June 30, 2025 08:02

Merge branch 'main' into ysc-homog-tkv-opt-joshua

fce5d91

Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com>

testing parameters

df75e41

Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>

fix fmt

226db17

Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>

yannicks1 added 2 commits July 2, 2025 08:13

Merge branch 'main' into ysc-homog-tkv-opt-joshua

cdbaa45

Merge branch 'main' into ysc-homog-tkv-opt-joshua

c0fe359

yannicks1 and others added 4 commits July 10, 2025 09:42

Merge branch 'main' into ysc-homog-tkv-opt-joshua

21da7da

Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>

Merge branch 'main' into ysc-homog-tkv-opt-joshua

0830748

Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>

Merge branch 'main' into ysc-homog-tkv-opt-joshua

4f4706b

fix bug and tests

a33d5e5

Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com>

Merge branch 'main' into ysc-homog-tkv-opt-joshua

5cb7413

Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com>

revert example script

ed08bb7

Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>

yannicks1 changed the title ~~[do not merge][CB] Reduce wastage in prefill compute and pad blocks in homogeneous continuous batching~~ [CB] Reduce wastage in prefill compute and pad blocks in homogeneous continuous batching Jul 23, 2025

yannicks1 marked this pull request as ready for review July 23, 2025 16:24

yannicks1 requested review from prashantgupta24, sducouedic, rafvasq, tdoublep and nikolaospapandreou as code owners July 23, 2025 16:24

yannicks1 requested review from JRosenkranz and joerunde July 23, 2025 16:25

prashantgupta24 reviewed Jul 23, 2025

View reviewed changes

vllm_spyre/v1/worker/spyre_model_runner.py

Copy link

Collaborator

prashantgupta24 Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth adding some debug logs to the optimizations?

maxdebayser reviewed Jul 23, 2025

View reviewed changes

prashantgupta24 reviewed Jul 23, 2025

View reviewed changes

prashantgupta24 mentioned this pull request Jul 23, 2025

[CB] Add scheduling tests #329

Draft

	print("\nAll results match!\n")
	print("\nAll results match!\n")

[CB] Reduce wastage in prefill compute and pad blocks in homogeneous continuous batching #262

Are you sure you want to change the base?

[CB] Reduce wastage in prefill compute and pad blocks in homogeneous continuous batching #262

Uh oh!

Conversation

yannicks1 commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!