Skip to content

Gemma3 v1.22 changes (Sliding_Window feature + few others) #1660

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 46 commits into
base: v1.22.0_next
Choose a base branch
from

Conversation

jiminha
Copy link

@jiminha jiminha commented Jul 25, 2025

This PR contains following changes

  1. Port Gemma3 SLIDING_WINDOW FusedSDPA feature from habana_main + Add a few extra fixes including..
  • Sliding FusedSDPA kernel, we are adding threshold variable to enable or disable to use optimized kernel. This kernel will be performance/memory benefit for longer sequence. We are providing environment variable to control per customer request.
  • Based on the threshold, choose different prompt bucket, if it's smaller than the threshold, use PROMPT_BUCKET_STEP, otherwise use SLICE_SIZE.
  • Added mark_step before SLIDING FusedSDPA is run.
  • Misc fixes for bucket related issue.
  1. upstream fixes
    [Model][Gemma3] Cast image pixel values already on CPU vllm-project/vllm#18732
    [Model] Fix a check for None but the return value was empty list in Gemma3 MM vision_embeddings vllm-project/vllm#21479
    Mark invariant normalizer in Gemma as non-persistent vllm-project/vllm#19788

  2. optimized Gemma3RMSNorm with FusedRMSNorm
    Dependent on Fixes from PR#1635 applied to v1.22.0_next branch #1647

Run command with.
VLLM_FUSEDSDPA_SLIDE_THLD=2048 VLLM_EXPONENTIAL_BUCKETING=false VLLM_PROMPT_BS_BUCKET_MAX=64 VLLM_PROMPT_SEQ_BUCKET_STEP=1024 VLLM_PROMPT_SEQ_BUCKET_MAX=20480 PT_HPU_SDPA_QKV_SLICE_MODE_FWD=1

@jiminha
Copy link
Author

jiminha commented Jul 28, 2025

This has co-dependency with #1660

@jiminha jiminha changed the title Add new variable VLLM_FUSEDSDPA_SLIDE_THLD(default 8192) Gemma3 Sliding_Window feature for v1.22 Jul 28, 2025
jiminha added 4 commits July 28, 2025 15:00
QKV with sliding kernel with causal is not supported yet. However
when attn_mask is there, QKV is supported, so no need for repeat
KV.
@libinta libinta changed the title Gemma3 Sliding_Window feature for v1.22 Gemma3 v1.22 changes (Sliding_Window feature + few others) Jul 30, 2025
Comment on lines +356 to +358
self.slice_size = int(os.getenv("PT_HPU_SDPA_BC_FACTOR", "1024"))
self.sliding_window_thld = int(
os.environ.get('VLLM_FUSEDSDPA_SLIDE_THLD', '8192'))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are those two new env variables? shouldnt they be in readme?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PT one is not new, it's related to internal flashattn3 implementation. For vllm, we try to give the optimized number ourselves for now.

@@ -7,4 +7,4 @@ ray
triton==3.1.0
setuptools>=77.0.3
setuptools-scm>=8
vllm-hpu-extension @ git+https://github.com/HabanaAI/vllm-hpu-extension.git@61dafb3
vllm-hpu-extension @ git+https://github.com/HabanaAI/vllm-hpu-extension.git@009adb2
Copy link

@michalkuligowski michalkuligowski Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks good, please update sha to the right once HabanaAI/vllm-hpu-extension#315 is merged (it will differ from 009adb2 which is only visible on PR branch)

@michalkuligowski
Copy link

/run-gaudi-tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants