[Kernel][Hardware][AMD] Bf16 mfma opt for ROCm skinny GEMMs #17071

amd-hhashemi · 2025-04-23T18:00:02Z

Bf16 mfma opt for ROCm skinny GEMMs

github-actions · 2025-04-23T18:00:12Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

tjtanaa · 2025-04-30T06:46:54Z

@amd-hhashemi hi.
Thank you for optimizing bf16. How much perf gain could we expect on mi300?

Signed-off-by: tjtanaavllm <tunjian.tan@amd.com>

amd-hhashemi · 2025-04-30T09:06:38Z

@amd-hhashemi hi. Thank you for optimizing bf16. How much perf gain could we expect on mi300?

Hey, this optimization shows 25% speedup on llama3 bf16 batch-1 on MI300. The prior solution does expensive bf16->float conversion followed by FMA ops. This optimization avoids that by using MFMAs instead, which is much more efficient.

SageMoore · 2025-05-01T13:23:56Z

Hi, @amd-hhashemi. Thanks for the contribution! Could you just run a quick serving benchmark to make sure there are no obvious perf regressions? I'm somewhat fuzzy on the exact cases that skinny gemm is enabled but I assume that it will be used in llama 3.1 8B.

Additionally, can you post the benchmark you ran that is giving you 25% speedup?

Serving commands:
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 4444 --disable-log-requests
followed by:
python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.1-8B-Instruct --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --ignore-eos --port 4444

tjtanaa · 2025-05-01T15:24:52Z

Hi @amd-hhashemi Is there a difference between this PR and https://github.com/ROCm/aiter/blob/main/csrc/kernels/custom_kernels.cu ?

amd-hhashemi · 2025-05-01T18:29:19Z

You mean this PR?
ROCm#520

There is no difference, I wrote that first, then the upstream version was merged to ROCm. So the ROCm one can be dropped and we'll later merge with upstream.

amd-hhashemi · 2025-05-01T18:35:59Z

Oh sorry I didn't realize you were point to Aiter.
I didn't know it had been pulled into Aiter.
Although it seems to be the original version, before fp8 or bf16 support was added.

amd-hhashemi · 2025-05-01T19:13:52Z

Hi SageMoore, I will run serving benchmark.
This is what I ran:
python benchmarks/benchmark_latency.py --model /data
/Meta-Llama-3-8B-Instruct --batch-size 1 --dtype bfloat16

[https://github.com/amd-hhashemi/vllm/blob/main/benchmarks/benchmark_latency.py]

Original reported latency: ~1.07sec
After this optimization: ~0.84sec
(it's actually more like ~22% speedup)
[Note: these numbers were actually on a downsized version of MI300, but since it's a compute bottleneck, it should be same on full MI300. I will verify that too]
The skinny gemms get most heavily used with low batch sizes.

amd-hhashemi · 2025-05-01T20:00:10Z

[corrected, with warmup runs]

I ran the server benchmark before and after the change. There isn't any change on server throughput test (this is expected, skinny GEMMs only show up in low batch count):

Before:

After this code change:

Signed-off-by: charlifu <charlifu@amd.com>

tjtanaa · 2025-05-05T07:00:29Z

Oh sorry I didn't realize you were point to Aiter. I didn't know it had been pulled into Aiter. Although it seems to be the original version, before fp8 or bf16 support was added.

Will there be plans to integrate this updated kernel into AITER?

Signed-off-by: charlifu <charlifu@amd.com>

…ject#17071) Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com> Signed-off-by: charlifu <charlifu@amd.com> Co-authored-by: charlifu <charlifu@amd.com> Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>

…ject#17071) Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com> Signed-off-by: charlifu <charlifu@amd.com> Co-authored-by: charlifu <charlifu@amd.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

…ject#17071) Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com> Signed-off-by: charlifu <charlifu@amd.com> Co-authored-by: charlifu <charlifu@amd.com>

…ject#17071) Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com> Signed-off-by: charlifu <charlifu@amd.com> Co-authored-by: charlifu <charlifu@amd.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>

…ject#17071) Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com> Signed-off-by: charlifu <charlifu@amd.com> Co-authored-by: charlifu <charlifu@amd.com> Signed-off-by: minpeter <kali2005611@gmail.com>

amd-hhashemi added 3 commits April 28, 2025 17:33

mfma optimization of wvspltk solution for bf16 skinny GEMMs

b2e2d43

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

fix MI250

87dbab7

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

lint fix

295d4d5

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

amd-hhashemi force-pushed the bf16_mfma_opt branch from 2bcb865 to 295d4d5 Compare April 28, 2025 17:36

Merge branch 'vllm-project:main' into bf16_mfma_opt

0adef12

tjtanaavllm added a commit to ROCm/vllm that referenced this pull request Apr 30, 2025

cherry pick upstream PR vllm-project#17071

63c99ce

Signed-off-by: tjtanaavllm <tunjian.tan@amd.com>

fix llmm on k size 6114

10437c9

Signed-off-by: charlifu <charlifu@amd.com>

charlifu requested review from tlrmchlsmth and WoosukKwon as code owners May 2, 2025 21:39

SageMoore approved these changes May 5, 2025

View reviewed changes

robertgshaw2-redhat approved these changes May 6, 2025

View reviewed changes

robertgshaw2-redhat enabled auto-merge (squash) May 6, 2025 17:04

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label May 6, 2025

charlifu added 2 commits May 7, 2025 14:48

Merge branch 'main' into bf16_mfma_opt

09b517d

Signed-off-by: charlifu <charlifu@amd.com>

enable skinny gemm for bs4

e796ad9

Signed-off-by: charlifu <charlifu@amd.com>

auto-merge was automatically disabled May 7, 2025 14:51
Head branch was pushed to by a user without write access

add cache to on_mi250_mi300

c558b03

Signed-off-by: charlifu <charlifu@amd.com>

vllm-bot merged commit 5a499e7 into vllm-project:main May 8, 2025
76 of 80 checks passed

gshtras mentioned this pull request May 9, 2025

Cherry pick skinny gemms ROCm/vllm#544

Merged

tanujtiwari1998 mentioned this pull request Jul 8, 2025

cached tokens completions character-tech/vllm#22

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Kernel][Hardware][AMD] Bf16 mfma opt for ROCm skinny GEMMs #17071

[Kernel][Hardware][AMD] Bf16 mfma opt for ROCm skinny GEMMs #17071

Uh oh!

amd-hhashemi commented Apr 23, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Apr 23, 2025

Uh oh!

tjtanaa commented Apr 30, 2025

Uh oh!

amd-hhashemi commented Apr 30, 2025 •

edited

Loading

Uh oh!

SageMoore commented May 1, 2025

Uh oh!

tjtanaa commented May 1, 2025

Uh oh!

amd-hhashemi commented May 1, 2025 •

edited

Loading

Uh oh!

amd-hhashemi commented May 1, 2025

Uh oh!

amd-hhashemi commented May 1, 2025 •

edited

Loading

Uh oh!

amd-hhashemi commented May 1, 2025 •

edited

Loading

Uh oh!

tjtanaa commented May 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Kernel][Hardware][AMD] Bf16 mfma opt for ROCm skinny GEMMs #17071

[Kernel][Hardware][AMD] Bf16 mfma opt for ROCm skinny GEMMs #17071

Uh oh!

Conversation

amd-hhashemi commented Apr 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 23, 2025

Uh oh!

tjtanaa commented Apr 30, 2025

Uh oh!

amd-hhashemi commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SageMoore commented May 1, 2025

Uh oh!

tjtanaa commented May 1, 2025

Uh oh!

amd-hhashemi commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amd-hhashemi commented May 1, 2025

Uh oh!

amd-hhashemi commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amd-hhashemi commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tjtanaa commented May 5, 2025

Uh oh!

Uh oh!

Uh oh!

amd-hhashemi commented Apr 23, 2025 •

edited by github-actions bot

Loading

amd-hhashemi commented Apr 30, 2025 •

edited

Loading

amd-hhashemi commented May 1, 2025 •

edited

Loading

amd-hhashemi commented May 1, 2025 •

edited

Loading

amd-hhashemi commented May 1, 2025 •

edited

Loading