Skip to content

Commit e34d130

Browse files
authored
[TPU] Temporary fix vmem oom for long model len by reducing page size (#20278)
Signed-off-by: Chenyaaang <chenyangli@google.com>
1 parent 7721ef1 commit e34d130

File tree

1 file changed

+6
-0
lines changed

1 file changed

+6
-0
lines changed

vllm/v1/attention/backends/pallas.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,12 @@ def get_max_num_seqs(model_len: int, page_size: int) -> int:
8686
# spill less likely. Meanwhile we make sure the page size is in [16, 256].
8787
@staticmethod
8888
def get_page_size(vllm_config: VllmConfig) -> int:
89+
# TODO: This is a temporary fix for vmem OOM.
90+
# For long model length, we use 16 page-size to avoid too much
91+
# VMEM spill. A more robust solution should be implemented to
92+
# handle VREG spills.
93+
if vllm_config.model_config.max_model_len > 8192:
94+
return 16
8995
page_size = next_power_of_2(
9096
vllm_config.model_config.max_model_len) // 16
9197
if page_size <= 16:

0 commit comments

Comments
 (0)