add more generic kernel for fp8 blockwise scaling #2592

danielvegamyhre · 2025-07-24T03:24:58Z

Stacked PRs:

add more generic kernel for fp8 blockwise scaling

Add generic FP8 blockwise quantization kernel from fbgemm_gpu
Add tests to verify numerics against torch implementation.
Add benchmarking script to bench the 3 blockwise quantization options (torch.compile, fbgemm_gpu kernels, and deepgemm kernels

Performance

TL;DR fbgemm is the best overall right now. deepgemm and torch.compile are both fast for 1x128 but slow for 128x128.

A_shape         block_shape      torch_us    fbgemm_us    deepgemm_us
--------------  -------------  ----------  -----------  -------------
(1024, 1024)    (1,128)            12.096       18.144         17.408
(1024, 1024)    (128,128)          22.432       12.288         17.408
(2048, 2048)    (1,128)            51.328       41.984         40.096
(2048, 2048)    (128,128)          84.192       15.264         40.096
(4096, 4096)    (1,128)           118.784      134.336        132.128
(4096, 4096)    (128,128)         241.68        25.728        132.096
(8192, 8192)    (1,128)           389.344      509.04         498.72
(8192, 8192)    (128,128)         874.56        62.464        498.656
(16384, 16384)  (1,128)          1456.16      1998.66        1964.96
(16384, 16384)  (128,128)        3377.31       183.296       1965.02
(32768, 32768)  (1,128)          5732.42      7960.21        7830.56
(32768, 32768)  (128,128)       13692.4        669.664       7831.14

pytorch-bot · 2025-07-24T03:25:01Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2592

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 8 New Failures

As of commit fa64d54 with merge base 0e00df3 ():

NEW FAILURES - The following jobs have failed:

Run Regression Tests / test (CPU 2.5.1, linux.4xlarge, torch==2.5.1 --index-url https://download.pytorch.org/whl/cpu, cp... / linux-job (gh)
RuntimeError: Command docker exec -t aee3d0b31540caaf8521237a894bd0d0a477bc19c3ee267e8f7f4f2e69ec2f1d /exec failed with exit code 2
Run Regression Tests / test (CPU 2.6, linux.4xlarge, torch==2.6.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
RuntimeError: Command docker exec -t 450381663da02eb8a85d67c882ff2dec35a377242cdaaa3ac317f99107e910c4 /exec failed with exit code 2
Run Regression Tests / test (CPU 2.7, linux.4xlarge, torch==2.7.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
RuntimeError: Command docker exec -t 8c6d41add9cbe809ac9ddf5b9ac0a2428f8e7de19122501ebaee56813d14e070 /exec failed with exit code 2
Run Regression Tests / test (CUDA 2.5.1, linux.g5.12xlarge.nvidia.gpu, torch==2.5.1 --index-url https://download.pytorch... / linux-job (gh)
RuntimeError: Command docker exec -t b9221cbea9714881bd7022d483de529cf4b03f76e534a557b3943a58f6fbe62d /exec failed with exit code 1
Run Regression Tests / test (CUDA 2.6, linux.g5.12xlarge.nvidia.gpu, torch==2.6.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t f565ea9868646eb32dca1c34ee4cf2e7f49e1585e65cb576f146aa95b14bb3cb /exec failed with exit code 1
Run Regression Tests / test (CUDA 2.7, linux.g5.12xlarge.nvidia.gpu, torch==2.7.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t ab15c68d3d18976a5f19db9b0d2408daa90782b2569610764a697be362b03b57 /exec failed with exit code 1
Run Regression Tests / test-nightly (CPU Nightly, linux.4xlarge, --pre torch --index-url https://download.pytorch.org/wh... / linux-job (gh)
RuntimeError: Command docker exec -t e80aa1464b875c3ff8df53305d4044d385c19dae5bbef816c9ca76393ee1333c /exec failed with exit code 2
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh)
RuntimeError: Command docker exec -t c137f1bd3a8b28e152ed326497486f45b7bb64a1d4ea1462786df28d5c491205 /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

stack-info: PR: #2592, branch: danielvegamyhre/stack/15

danielvegamyhre · 2025-07-24T04:03:49Z

cc @vkuzo @drisspg for review

drisspg · 2025-07-24T04:11:46Z

test/prototype/test_fp8_blockwise_kernels.py

+    error = torch.norm(C - C_q) / torch.norm(C)
+    print(f"Relative Error: {error.item():.6f}")
+
+    assert error < 0.1, "Quantize gemm error is too high"


Can you use sqnr everywhere match w/ existing numerics testing

Updated to use SQNR

drisspg · 2025-07-24T04:12:37Z

torchao/prototype/blockwise_fp8/kernels.py

+
+# original implementation from fbgemm_gpu:
+# https://github.com/pytorch/FBGEMM/blob/b19401e913fcdff536dc097fa3013a0a9d66256e/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py#L3091
+def triton_quantize_fp8_block(


since we have an optional runtime dependency on fbgemm can we just call their kernel directly?

Yes that is the desired end state. For now I have tried and have had repeated problems getting it to work so far (fbgemm-gpu-genai), e.g. undefined symbols. Tried on both H100 and B200 and got different undefined symbol errors

stack-info: PR: #2592, branch: danielvegamyhre/stack/15

drisspg · 2025-07-24T04:14:14Z

(32768, 32768)  (1,128)          5732.42      7960.21        7830.56
(32768, 32768)  (128,128)       13692.4        669.664       7831.14

this number is kinda weird to me, do you have memory bandwidth calcs? I dont immediately get why there is a 10x delta in group wise vs blockwise

stack-info: PR: #2592, branch: danielvegamyhre/stack/15

danielvegamyhre · 2025-07-25T14:17:57Z

this number is kinda weird to me, do you have memory bandwidth calcs? I dont immediately get why there is a 10x delta in group wise vs blockwise

Yeah I agree it's odd, will try adding some mem bw calcs, was thinking about checking with Josh / fbgemm team as well if perhapst here is a different kernel they use for activation quant.

danielvegamyhre added a commit that referenced this pull request Jul 24, 2025

add more generic kernel for fp8 blockwise scaling

3b36022

stack-info: PR: #2592, branch: danielvegamyhre/stack/15

danielvegamyhre force-pushed the danielvegamyhre/stack/15 branch from d0cd3be to 3b36022 Compare July 24, 2025 03:25

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 24, 2025

danielvegamyhre added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Jul 24, 2025

danielvegamyhre requested review from vkuzo and drisspg July 24, 2025 04:01

drisspg reviewed Jul 24, 2025

View reviewed changes

danielvegamyhre added a commit that referenced this pull request Jul 24, 2025

add more generic kernel for fp8 blockwise scaling

9821453

stack-info: PR: #2592, branch: danielvegamyhre/stack/15

danielvegamyhre force-pushed the danielvegamyhre/stack/15 branch from 3b36022 to 9821453 Compare July 24, 2025 04:13

danielvegamyhre force-pushed the danielvegamyhre/stack/15 branch from 9821453 to ee6ce03 Compare July 25, 2025 03:04

danielvegamyhre mentioned this pull request Jul 25, 2025

make fp8 blockwise linear differentiable; use new kernels #2602

Open

add more generic kernel for fp8 blockwise scaling

fa64d54

stack-info: PR: #2592, branch: danielvegamyhre/stack/15

danielvegamyhre force-pushed the danielvegamyhre/stack/15 branch from ee6ce03 to fa64d54 Compare July 25, 2025 03:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add more generic kernel for fp8 blockwise scaling #2592

add more generic kernel for fp8 blockwise scaling #2592

danielvegamyhre commented Jul 24, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 24, 2025 •

edited

Loading

Uh oh!

danielvegamyhre commented Jul 24, 2025

Uh oh!

drisspg Jul 24, 2025

Uh oh!

danielvegamyhre Jul 25, 2025

Uh oh!

drisspg Jul 24, 2025 •

edited

Loading

Uh oh!

danielvegamyhre Jul 25, 2025 •

edited

Loading

Uh oh!

drisspg commented Jul 24, 2025

Uh oh!

danielvegamyhre commented Jul 25, 2025

Uh oh!

Uh oh!

add more generic kernel for fp8 blockwise scaling #2592

Are you sure you want to change the base?

add more generic kernel for fp8 blockwise scaling #2592

Conversation

danielvegamyhre commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!