Add nd_loop and Enable block_n tiling for all_gather_lhs_matmul #29822

hanzlfs · 2025-06-27T21:09:09Z

Enable block_n tiling
use plgpu.nd_loop
[PAIR] justinfu@google.com

justinjfu

Thanks for the fixes!

justinjfu · 2025-06-27T21:20:44Z

jax/experimental/pallas/ops/gpu/collective_matmul_mgpu.py

@@ -150,7 +152,7 @@ def k_loop(idxs, lhs_smem, rhs_smem):
            # We only delay release by 1 step, so we need to wait for the
            # previous copies.
            plgpu.wait_smem_to_gmem(1, wait_read_only=True)
-          k_loop(scratch_ref.at[scratch_slot], rhs_ref)
+          k_loop(scratch_ref.at[scratch_slot], rhs_ref.at[:,n_tile_slice])


nit: Add a space for formatting: rhs_ref.at[:,n_tile_slice] -> rhs_ref.at[:, n_tile_slice]

apaszke

I don't think that's the right way to do it. The gathers happen only on the M dimension and adding the N loop in the same place will perform the same gather n // block_n times. Instead, for every M chunk we gather, we should run an inner loop that steps over all the N blocks that need to be multiplied with it

hanzlfs · 2025-06-30T08:35:59Z

I don't think that's the right way to do it. The gathers happen only on the M dimension and adding the N loop in the same place will perform the same gather n // block_n times. Instead, for every M chunk we gather, we should run an inner loop that steps over all the N blocks that need to be multiplied with it

Thanks I will update this part tmr. I met another issue, if I want to use a 2d mesh x, y: (2, 4), or even x, y : (1, 8), to use axis_name = (x, y) it will give me following error

    out = core_map_p.bind(*consts, jaxpr=jaxpr, mesh=mesh,
jax._src.source_info_util.JaxStackTraceBeforeTransformation: ValueError: Failed to recompute the async_copy peer id on the host``` 
what's the corresponding change would be needed? cc @justinjfu

apaszke · 2025-06-30T09:04:52Z

Could you please send a small PR with the change needed to reproduce the problem? I can take a look

hanzlfs · 2025-06-30T14:28:24Z

Could you please send a small PR with the change needed to reproduce the problem? I can take a look
This one should do
#29849

Add block_n and nd_loop for collective_matmul_mgpu.all_gather_lhs_matmul

9ca2f4f

justinjfu approved these changes Jun 27, 2025

View reviewed changes

google-ml-butler bot added kokoro:force-run pull ready Ready for copybara import and testing labels Jun 27, 2025

justinjfu requested a review from apaszke June 27, 2025 21:11

justinjfu reviewed Jun 27, 2025

View reviewed changes

apaszke requested changes Jun 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add nd_loop and Enable block_n tiling for all_gather_lhs_matmul #29822

Add nd_loop and Enable block_n tiling for all_gather_lhs_matmul #29822

Uh oh!

hanzlfs commented Jun 27, 2025 •

edited

Loading

Uh oh!

justinjfu left a comment

Uh oh!

justinjfu Jun 27, 2025

Uh oh!

apaszke left a comment

Uh oh!

hanzlfs commented Jun 30, 2025

Uh oh!

apaszke commented Jun 30, 2025

Uh oh!

hanzlfs commented Jun 30, 2025

Uh oh!

Uh oh!

Add nd_loop and Enable block_n tiling for all_gather_lhs_matmul #29822

Are you sure you want to change the base?

Add nd_loop and Enable block_n tiling for all_gather_lhs_matmul #29822

Uh oh!

Conversation

hanzlfs commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

justinjfu left a comment

Choose a reason for hiding this comment

Uh oh!

justinjfu Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

hanzlfs commented Jun 30, 2025

Uh oh!

apaszke commented Jun 30, 2025

Uh oh!

hanzlfs commented Jun 30, 2025

Uh oh!

Uh oh!

hanzlfs commented Jun 27, 2025 •

edited

Loading