Skip to content

Update fp8 paged attention for MI308 #592

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

amd-xiaoyu12
Copy link

@amd-xiaoyu12 amd-xiaoyu12 commented Jul 9, 2025

Please direct your PRs to the upstream vllm (https://github.com/vllm-project/vllm.git)

Accepting PRs into the ROCm fork (https://github.com/ROCm/vllm) will require a clear previously communicated exception

Summary:
Support full fp8 MFMA with wrap level dynamic query quantization to improve fp8 performance on MI308, which can also benefits other MI300x accelerator or latest hardware.

  • Performance
image
  • Unit test - attention output
image * Lm-eval-harness ppl test image

@amd-xiaoyu12 amd-xiaoyu12 changed the title Update fp8 paged attention Update fp8 paged attention for MI308 Jul 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant