Gemma3 v1.22 changes (Sliding_Window feature + few others) #1660

jiminha · 2025-07-25T18:10:03Z

This PR contains following changes

Port Gemma3 SLIDING_WINDOW FusedSDPA feature from habana_main + Add a few extra fixes including..

Sliding FusedSDPA kernel, we are adding threshold variable to enable or disable to use optimized kernel. This kernel will be performance/memory benefit for longer sequence. We are providing environment variable to control per customer request.
Based on the threshold, choose different prompt bucket, if it's smaller than the threshold, use PROMPT_BUCKET_STEP, otherwise use SLICE_SIZE.
Added mark_step before SLIDING FusedSDPA is run.
Misc fixes for bucket related issue.

upstream fixes
[Model][Gemma3] Cast image pixel values already on CPU vllm-project/vllm#18732
[Model] Fix a check for None but the return value was empty list in Gemma3 MM vision_embeddings vllm-project/vllm#21479
Mark invariant normalizer in Gemma as non-persistent vllm-project/vllm#19788
optimized Gemma3RMSNorm with FusedRMSNorm
Dependent on Fixes from PR#1635 applied to v1.22.0_next branch #1647

Run command with.
VLLM_FUSEDSDPA_SLIDE_THLD=2048 VLLM_EXPONENTIAL_BUCKETING=false VLLM_PROMPT_BS_BUCKET_MAX=64 VLLM_PROMPT_SEQ_BUCKET_STEP=1024 VLLM_PROMPT_SEQ_BUCKET_MAX=20480 PT_HPU_SDPA_QKV_SLICE_MODE_FWD=1

FusedSDPA kernel with window_size+causal only works when seq_len is multiple of SLICE_SIZE. If not, fallback to the original implementation which creates attention_mask with window_size

-Move the seq_len check for use_sdpa_window to attn_metadata -Automatically set all environment variable

Additional changes from PR1597

remove print statement

For Sliding FusedSDPA kernel, add threshold variable to enable or disable.

jiminha · 2025-07-28T19:19:54Z

This has co-dependency with #1660

QKV with sliding kernel with causal is not supported yet. However when attn_mask is there, QKV is supported, so no need for repeat KV.

vllm/worker/hpu_model_runner.py

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

Signed-off-by: Hongmin Fan <fanhongmin@google.com>

michalkuligowski · 2025-07-30T06:56:51Z

vllm/worker/hpu_model_runner.py

+            self.slice_size = int(os.getenv("PT_HPU_SDPA_BC_FACTOR", "1024"))
+            self.sliding_window_thld = int(
+                os.environ.get('VLLM_FUSEDSDPA_SLIDE_THLD', '8192'))


are those two new env variables? shouldnt they be in readme?

PT one is not new, it's related to internal flashattn3 implementation. For vllm, we try to give the optimized number ourselves for now.

…position 1) must be Tensor, not NoneType

michalkuligowski · 2025-07-31T06:30:08Z

requirements/hpu.txt

@@ -7,4 +7,4 @@ ray
 triton==3.1.0
 setuptools>=77.0.3
 setuptools-scm>=8
-vllm-hpu-extension @ git+https://github.com/HabanaAI/vllm-hpu-extension.git@61dafb3
+vllm-hpu-extension @ git+https://github.com/HabanaAI/vllm-hpu-extension.git@009adb2


code looks good, please update sha to the right once HabanaAI/vllm-hpu-extension#315 is merged (it will differ from 009adb2 which is only visible on PR branch)

michalkuligowski · 2025-07-31T06:30:30Z

/run-gaudi-tests

jiminha and others added 20 commits July 24, 2025 10:57

Added support for FusedSDPA with window_size

3131947

Add envi variable to check validity of window kernel

4247aa1

FusedSDPA kernel with window_size+causal only works when seq_len is multiple of SLICE_SIZE. If not, fallback to the original implementation which creates attention_mask with window_size

Changes after code review

5e2ee1f

-Move the seq_len check for use_sdpa_window to attn_metadata -Automatically set all environment variable

Update hpu extension version to dev commit

bed05d6

Update hpu extension requirement commit id

4f2edff

merge #PR1589 onto v1.22.0

1c45fe7

Update hpu_model_runner.py with PR1597

8769df2

Additional changes from PR1597

Update requirements/hpu.txt based on PR1614

7c049a0

added missing definitions

bffcc67

Gemma3 related changes for 1.22

6e059b6

Update utils.py

30ef70b

remove print statement

more fixes after merging 1597

c56d886

fix bypass_model_exec

25e9196

fix precommit error

917a861

change requirements/hpu.txt mode

22d0809

Modifications from PR#1635 (rebased on PR#1616)

f95cc0f

fixes for pre-commit checks

ad61fa0

minor pre-commit fix

d6d49d1

rebased on latest v1.22.0_next

518dab2

Add new variable VLLM_FUSEDSDPA_SLIDE_THLD(default 8192)

c5d4697

For Sliding FusedSDPA kernel, add threshold variable to enable or disable.

jiminha requested review from kzawora-intel, madamczyk-intel, michalkuligowski, mgawarkiewicz-intel, vivekgoe, afierka-intel, xuechendi, jikunshang, mswiniarsk and PatrykWo as code owners July 25, 2025 18:10

fix graph mode for image is missing when not warmup

74a991d

jiminha changed the title ~~Add new variable VLLM_FUSEDSDPA_SLIDE_THLD(default 8192)~~ Gemma3 Sliding_Window feature for v1.22 Jul 28, 2025

jiminha added 4 commits July 28, 2025 15:00

Adjust profiler for image models

239d20a

Only KV repeat when attn_mask is None(causal=True)

a080a20

QKV with sliding kernel with causal is not supported yet. However when attn_mask is there, QKV is supported, so no need for repeat KV.

pre-commit fix

11344f5

Update hpu requirement.txt

0a857b7

libinta reviewed Jul 29, 2025

View reviewed changes

vllm/worker/hpu_model_runner.py Outdated Show resolved Hide resolved

jiminha and others added 8 commits July 28, 2025 21:29

Move SLIDE_WINDOW_RIGHT to hpu_attn only

a7aacfb

Updated based on the comments

559b049

[Model][Gemma3] Cast image pixel values already on CPU

e91c530

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

Fix a check for None but the return value was empty list in Gemma3 MM.

7da629a

Signed-off-by: Hongmin Fan <fanhongmin@google.com>

Mark invariant normalizer in Gemma as non-persistent#19788

49c55f3

code review changes

9a31806

Merge branch 'v1.22.0_next' into jha/slidingthld

875ffd2

Use FusedRMSNorm for gemma

4458a67

shepark force-pushed the jha/slidingthld branch from e4ed765 to 4458a67 Compare July 30, 2025 05:19

libinta changed the title ~~Gemma3 Sliding_Window feature for v1.22~~ Gemma3 v1.22 changes (Sliding_Window feature + few others) Jul 30, 2025

libinta added 2 commits July 29, 2025 23:13

Update layernorm.py

7f39ec1

Update layernorm.py

eeeb277

michalkuligowski reviewed Jul 30, 2025

View reviewed changes

Jianhong-Zhang force-pushed the jha/slidingthld branch from ffa1c79 to d33c55c Compare July 30, 2025 20:58

Use FusedRMSNorm for gemma

8eda4aa

Jianhong-Zhang force-pushed the jha/slidingthld branch from d33c55c to 8eda4aa Compare July 31, 2025 01:01

hsubramony and others added 3 commits July 30, 2025 18:32

[SW-235104][vLLM] pipeline_entrypoints - matmul(): argument 'input' (…

9c52acf

…position 1) must be Tensor, not NoneType

Merge branch 'v1.22.0_next' into jha/slidingthld

12777f1

pre-commit fix

eb888dc

michalkuligowski approved these changes Jul 31, 2025

View reviewed changes

Update hpu.txt with vllm-hpu-extension

1ec2cf5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gemma3 v1.22 changes (Sliding_Window feature + few others) #1660

Gemma3 v1.22 changes (Sliding_Window feature + few others) #1660

jiminha commented Jul 25, 2025 •

edited by github-actions bot

Loading

Uh oh!

jiminha commented Jul 28, 2025

Uh oh!

Uh oh!

michalkuligowski Jul 30, 2025

Uh oh!

libinta Jul 30, 2025

Uh oh!

michalkuligowski Jul 31, 2025 •

edited

Loading

Uh oh!

michalkuligowski commented Jul 31, 2025

Uh oh!

Uh oh!

Gemma3 v1.22 changes (Sliding_Window feature + few others) #1660

Are you sure you want to change the base?

Gemma3 v1.22 changes (Sliding_Window feature + few others) #1660

Conversation

jiminha commented Jul 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiminha commented Jul 28, 2025

Uh oh!

Uh oh!

michalkuligowski Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

libinta Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

michalkuligowski Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michalkuligowski commented Jul 31, 2025

Uh oh!

Uh oh!

jiminha commented Jul 25, 2025 •

edited by github-actions bot

Loading

michalkuligowski Jul 31, 2025 •

edited

Loading