-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
[Kernel][Hardware][AMD] Bf16 mfma opt for ROCm skinny GEMMs #17071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
2bcb865
to
295d4d5
Compare
@amd-hhashemi hi. |
Signed-off-by: tjtanaavllm <tunjian.tan@amd.com>
Hey, this optimization shows 25% speedup on llama3 bf16 batch-1 on MI300. The prior solution does expensive bf16->float conversion followed by FMA ops. This optimization avoids that by using MFMAs instead, which is much more efficient. |
Hi, @amd-hhashemi. Thanks for the contribution! Could you just run a quick serving benchmark to make sure there are no obvious perf regressions? I'm somewhat fuzzy on the exact cases that skinny gemm is enabled but I assume that it will be used in llama 3.1 8B. Additionally, can you post the benchmark you ran that is giving you 25% speedup? Serving commands: |
Hi @amd-hhashemi Is there a difference between this PR and https://github.com/ROCm/aiter/blob/main/csrc/kernels/custom_kernels.cu ? |
You mean this PR? There is no difference, I wrote that first, then the upstream version was merged to ROCm. So the ROCm one can be dropped and we'll later merge with upstream. |
Oh sorry I didn't realize you were point to Aiter. |
Hi SageMoore, I will run serving benchmark. [https://github.com/amd-hhashemi/vllm/blob/main/benchmarks/benchmark_latency.py] Original reported latency: ~1.07sec |
Signed-off-by: charlifu <charlifu@amd.com>
Will there be plans to integrate this updated kernel into AITER? |
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: charlifu <charlifu@amd.com>
Head branch was pushed to by a user without write access
Signed-off-by: charlifu <charlifu@amd.com>
…ject#17071) Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com> Signed-off-by: charlifu <charlifu@amd.com> Co-authored-by: charlifu <charlifu@amd.com> Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
…ject#17071) Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com> Signed-off-by: charlifu <charlifu@amd.com> Co-authored-by: charlifu <charlifu@amd.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
…ject#17071) Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com> Signed-off-by: charlifu <charlifu@amd.com> Co-authored-by: charlifu <charlifu@amd.com>
…ject#17071) Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com> Signed-off-by: charlifu <charlifu@amd.com> Co-authored-by: charlifu <charlifu@amd.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
…ject#17071) Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com> Signed-off-by: charlifu <charlifu@amd.com> Co-authored-by: charlifu <charlifu@amd.com> Signed-off-by: minpeter <kali2005611@gmail.com>
Bf16 mfma opt for ROCm skinny GEMMs