[mxfp8 moe training] per group scale conversion to blocked format with groups along K dim (for 2d2d grouped gemm) #2956

danielvegamyhre · 2025-09-08T16:36:04Z

Summary

We have a triton kernel that does per group mxfp8 scale conversion to blocked format, for 2d-3d grouped gemms, where the groups are along the total_M dimension in (total_M, K) @ (E, K, N)
We now need triton kernels that does per group conversion for 2d-2d grouped gemms, where the groups are along the scaled dim: (M, total_K) @ (total_K, N).

Memory layout

LHS operand for 2d-3d MXFP8 grouped gemm

This is the existing kernel, the layout is much simpler for the 2d-3d case where the groups are along M:

LHS operand for 2d-2d MXFP8 grouped gemm

When groups are along the scaled dim being contracted, the memory layout is more complicated, as we have to represent separate standalone "row of blocks major" layouts in subtensors that are part of a larger parent tensor.

Other

I also removed the Mg param from the torch_to_blocked_per_group_2d function (used for 2d-3d grouped gemms), since it is not used.

Test plan

pytest test/prototype/moe_training/test_kernels.py -k test_mxfp8_per_group_blocked_scales_2d2d -s

…d gemm

pytorch-bot · 2025-09-08T16:36:07Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2956

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 213f19b with merge base f1acc1e ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

drisspg · 2025-09-10T16:29:04Z

test/prototype/moe_training/test_kernels.py

+    _, output_group_offsets = compute_per_group_blocked_scale_offsets_2d2d_lhs(
+        input_group_offsets
+    )
+    assert torch.allclose(output_group_offsets, ref_start_cols_after_padding), (


torch.equal?

drisspg · 2025-09-10T16:30:45Z

torchao/prototype/moe_training/kernels/mxfp8_blocked_scales.py

    return blocked_scales, start_row_after_padding


+def torch_to_blocked_per_group_2d2d_lhs(


nit: maybe soemthig like 2d_kmajor since this is only operates on 1 tensor

Hmm for 2d-3d and 2d-2d, the 2d tensors are all K major aren't they? I was actually thinking we should just remove the _lhs suffix from the name, since I think the kernel will actually work for both LHS and RHS operands in the 2d2d grouped gemm, given shapes (M, total_K) and (N, total_K)

How about: torch_to_blocked_per_group_for_2d2d (added "for")

Renamed this func and others for clarity that the difference is about grouping along M dim vs K dim

drisspg · 2025-09-10T16:37:20Z

torchao/prototype/moe_training/kernels/mxfp8_blocked_scales.py

+        orig_offsets + group_pid - 1, mask=group_pid > 0, other=0
+    )
+    input_group_end_col = tl.load(
+        orig_offsets + group_pid, mask=group_pid < num_groups, other=0


nit we dont need a mask for this load right?

drisspg · 2025-09-10T16:37:39Z

torchao/prototype/moe_training/kernels/mxfp8_blocked_scales.py

+        orig_offsets + group_pid, mask=group_pid < num_groups, other=0
+    )
+    # Output scales start row we will begin writing to
+    output_group_start_col = tl.load(


drisspg · 2025-09-10T16:38:34Z

torchao/prototype/moe_training/kernels/mxfp8_blocked_scales.py

+        output_scales_group_offsets + group_pid, mask=group_pid < num_groups, other=0
+    )
+
+    # Calculate destination indices for each row and col in block swizzled layout.


can you make this a helper jit function and reuse elsewwhere, its fine if you stack that commit on this

good idea, done

danielvegamyhre · 2025-09-10T21:19:02Z

@drisspg i finished addressing your comments, ready for another look

[mxfp8 moe training] blocked scale conversion for LHS of 2d-2d groupe…

56722eb

…d gemm

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 8, 2025

danielvegamyhre added topic: not user facing Use this tag if you don't want this PR to show up in release notes and removed CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. labels Sep 8, 2025

danielvegamyhre marked this pull request as draft September 8, 2025 16:36

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 8, 2025

danielvegamyhre added 3 commits September 8, 2025 14:31

debug

0c3b8d9

debug

e86811a

col of blocks method

11f25ec

danielvegamyhre marked this pull request as ready for review September 9, 2025 01:54

danielvegamyhre requested review from drisspg and vkuzo September 9, 2025 01:54

danielvegamyhre changed the title ~~[WIP] [mxfp8 moe training] blocked scale conversion for LHS of 2d-2d grouped gemm~~ [mxfp8 moe training] blocked scale conversion for LHS of 2d-2d grouped gemm Sep 9, 2025

danielvegamyhre changed the title ~~[mxfp8 moe training] blocked scale conversion for LHS of 2d-2d grouped gemm~~ [mxfp8 moe training] per group scale conversion to blocked format for LHS of 2d-2d grouped gemm Sep 9, 2025

danielvegamyhre force-pushed the contraction branch from df502e0 to 25f4c23 Compare September 9, 2025 15:50

danielvegamyhre added mx moe labels Sep 9, 2025

drisspg reviewed Sep 10, 2025

View reviewed changes

row of blocks within groups only

213f19b

danielvegamyhre force-pushed the contraction branch from 25f4c23 to 213f19b Compare September 10, 2025 21:17

danielvegamyhre mentioned this pull request Sep 11, 2025

[mxfp8 moe training] integrate mxfp8 grouped gemm and triton kernels for scale conversion to blocked format #2977

Merged

danielvegamyhre changed the title ~~[mxfp8 moe training] per group scale conversion to blocked format for LHS of 2d-2d grouped gemm~~ [mxfp8 moe training] per group scale conversion to blocked format with groups along K dim (for 2d2d grouped gemm) Sep 11, 2025

drisspg approved these changes Sep 11, 2025

View reviewed changes

danielvegamyhre merged commit 14ca521 into main Sep 11, 2025
18 checks passed

danielvegamyhre mentioned this pull request Sep 17, 2025

[roadmap/tracker] Low precision MoE training #2147

Open

47 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[mxfp8 moe training] per group scale conversion to blocked format with groups along K dim (for 2d2d grouped gemm) #2956

[mxfp8 moe training] per group scale conversion to blocked format with groups along K dim (for 2d2d grouped gemm) #2956

Uh oh!

danielvegamyhre commented Sep 8, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 8, 2025 •

edited

Loading

Uh oh!

drisspg Sep 10, 2025

Uh oh!

drisspg Sep 10, 2025

Uh oh!

danielvegamyhre Sep 10, 2025 •

edited

Loading

Uh oh!

danielvegamyhre Sep 10, 2025

Uh oh!

drisspg Sep 10, 2025

Uh oh!

danielvegamyhre Sep 10, 2025

Uh oh!

drisspg Sep 10, 2025

Uh oh!

drisspg Sep 10, 2025

Uh oh!

danielvegamyhre Sep 10, 2025

Uh oh!

danielvegamyhre commented Sep 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return blocked_scales, start_row_after_padding


		def torch_to_blocked_per_group_2d2d_lhs(

[mxfp8 moe training] per group scale conversion to blocked format with groups along K dim (for 2d2d grouped gemm) #2956

[mxfp8 moe training] per group scale conversion to blocked format with groups along K dim (for 2d2d grouped gemm) #2956

Uh oh!

Conversation

danielvegamyhre commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Memory layout

LHS operand for 2d-3d MXFP8 grouped gemm

LHS operand for 2d-2d MXFP8 grouped gemm

Other

Test plan

Uh oh!

pytorch-bot bot commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2956

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre commented Sep 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danielvegamyhre commented Sep 8, 2025 •

edited

Loading

pytorch-bot bot commented Sep 8, 2025 •

edited

Loading

danielvegamyhre Sep 10, 2025 •

edited

Loading