-
Notifications
You must be signed in to change notification settings - Fork 114
Gemma3 v1.22 changes (Sliding_Window feature + few others) #1660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v1.22.0_next
Are you sure you want to change the base?
Conversation
FusedSDPA kernel with window_size+causal only works when seq_len is multiple of SLICE_SIZE. If not, fallback to the original implementation which creates attention_mask with window_size
-Move the seq_len check for use_sdpa_window to attn_metadata -Automatically set all environment variable
Additional changes from PR1597
remove print statement
For Sliding FusedSDPA kernel, add threshold variable to enable or disable.
This has co-dependency with #1660 |
QKV with sliding kernel with causal is not supported yet. However when attn_mask is there, QKV is supported, so no need for repeat KV.
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Signed-off-by: Hongmin Fan <fanhongmin@google.com>
self.slice_size = int(os.getenv("PT_HPU_SDPA_BC_FACTOR", "1024")) | ||
self.sliding_window_thld = int( | ||
os.environ.get('VLLM_FUSEDSDPA_SLIDE_THLD', '8192')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are those two new env variables? shouldnt they be in readme?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PT one is not new, it's related to internal flashattn3 implementation. For vllm, we try to give the optimized number ourselves for now.
ffa1c79
to
d33c55c
Compare
d33c55c
to
8eda4aa
Compare
…position 1) must be Tensor, not NoneType
requirements/hpu.txt
Outdated
@@ -7,4 +7,4 @@ ray | |||
triton==3.1.0 | |||
setuptools>=77.0.3 | |||
setuptools-scm>=8 | |||
vllm-hpu-extension @ git+https://github.com/HabanaAI/vllm-hpu-extension.git@61dafb3 | |||
vllm-hpu-extension @ git+https://github.com/HabanaAI/vllm-hpu-extension.git@009adb2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code looks good, please update sha to the right once HabanaAI/vllm-hpu-extension#315 is merged (it will differ from 009adb2 which is only visible on PR branch)
/run-gaudi-tests |
This PR contains following changes
upstream fixes
[Model][Gemma3] Cast image pixel values already on CPU vllm-project/vllm#18732
[Model] Fix a check for None but the return value was empty list in Gemma3 MM vision_embeddings vllm-project/vllm#21479
Mark invariant normalizer in Gemma as non-persistent vllm-project/vllm#19788
optimized Gemma3RMSNorm with FusedRMSNorm
Dependent on Fixes from PR#1635 applied to v1.22.0_next branch #1647
Run command with.
VLLM_FUSEDSDPA_SLIDE_THLD=2048 VLLM_EXPONENTIAL_BUCKETING=false VLLM_PROMPT_BS_BUCKET_MAX=64 VLLM_PROMPT_SEQ_BUCKET_STEP=1024 VLLM_PROMPT_SEQ_BUCKET_MAX=20480 PT_HPU_SDPA_QKV_SLICE_MODE_FWD=1