Question about b_aw in wy_fast.py #308

lucianyao · 2025-04-05T10:22:51Z

lucianyao
Apr 5, 2025

Hi Team,

I wonder if replacing b_aw = tl.sum(tl.where(mask[:, None], b_Aw, 0), 0) with b_aw = b_Aw[i, :] in fwd_prepare_wy_repr_kernel of of wy_fast.py could reduce complexity from O(BC^2) to O(BC), saving memory and compute effort?

Furthermore, I wonder if we could replace b_aw = b_aw + tl.sum(b_aw[:, None] * b_Aw, 0) * (tl.arange(0, BC) < i) with b_aw = b_aw + tl.dot(b_aw, b_Aw) to make the code more readable? The updated loop would look like:

for i in range(1, BC):

mask = tl.arange(0, BC) == i 
b_aw = b_Aw[i, :]
b_aw = b_aw + tl.dot(b_aw, b_Aw)
b_Aw = tl.where(mask[:, None], b_aw, b_Aw)

It seems to maintain correctness while simplifying the logic and potentially improving GPU performance.

Any thoughts on these changes? Thanks for your great work!

Hong

yzhangcs · 2025-04-05T12:11:23Z

yzhangcs
Apr 5, 2025
Maintainer

@lucianyao Hi, thank you for your great suggestion. It's indeed a big bottleneck, but sadly could not be done in triton.

0 replies

yzhangcs · 2025-04-05T12:12:48Z

yzhangcs
Apr 5, 2025
Maintainer

We are looking for some solutions, these PRs could help you.

#270
#279
#296

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FLA

Question about b_aw in wy_fast.py #308

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

FLA

Question about b_aw in wy_fast.py #308

Uh oh!

lucianyao Apr 5, 2025

Replies: 2 comments

Uh oh!

Uh oh!

yzhangcs Apr 5, 2025 Maintainer

Uh oh!

yzhangcs Apr 5, 2025 Maintainer

lucianyao
Apr 5, 2025

yzhangcs
Apr 5, 2025
Maintainer

yzhangcs
Apr 5, 2025
Maintainer