-
Notifications
You must be signed in to change notification settings - Fork 307
add more generic kernel for fp8 blockwise scaling #2592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2592
Note: Links to docs will display an error until the docs builds have been completed. ❌ 8 New FailuresAs of commit fa64d54 with merge base 0e00df3 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
stack-info: PR: #2592, branch: danielvegamyhre/stack/15
d0cd3be
to
3b36022
Compare
error = torch.norm(C - C_q) / torch.norm(C) | ||
print(f"Relative Error: {error.item():.6f}") | ||
|
||
assert error < 0.1, "Quantize gemm error is too high" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use sqnr everywhere match w/ existing numerics testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use SQNR
|
||
# original implementation from fbgemm_gpu: | ||
# https://github.com/pytorch/FBGEMM/blob/b19401e913fcdff536dc097fa3013a0a9d66256e/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py#L3091 | ||
def triton_quantize_fp8_block( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we have an optional runtime dependency on fbgemm can we just call their kernel directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that is the desired end state. For now I have tried and have had repeated problems getting it to work so far (fbgemm-gpu-genai
), e.g. undefined symbols. Tried on both H100 and B200 and got different undefined symbol errors
stack-info: PR: #2592, branch: danielvegamyhre/stack/15
3b36022
to
9821453
Compare
this number is kinda weird to me, do you have memory bandwidth calcs? I dont immediately get why there is a 10x delta in group wise vs blockwise |
9821453
to
ee6ce03
Compare
stack-info: PR: #2592, branch: danielvegamyhre/stack/15
ee6ce03
to
fa64d54
Compare
Yeah I agree it's odd, will try adding some mem bw calcs, was thinking about checking with Josh / fbgemm team as well if perhapst here is a different kernel they use for activation quant. |
Stacked PRs:
add more generic kernel for fp8 blockwise scaling
Performance
TL;DR fbgemm is the best overall right now. deepgemm and torch.compile are both fast for 1x128 but slow for 128x128.