Skip to content

Add indirect pointer to barrier support in hl.signal & hl.wait (as_ptrs) #261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

joydddd
Copy link
Contributor

@joydddd joydddd commented Jul 10, 2025

@joydddd joydddd force-pushed the joydddd/stack/13 branch from dd98a26 to 4c4b3ae Compare July 10, 2025 18:52
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 10, 2025
@joydddd joydddd force-pushed the joydddd/stack/13 branch 2 times, most recently from 4eb10c4 to dac9b8f Compare July 10, 2025 23:13
@joydddd joydddd changed the base branch from main to joydddd/stack/14 July 10, 2025 23:13
@joydddd joydddd changed the base branch from joydddd/stack/14 to main July 11, 2025 00:32
@joydddd joydddd changed the base branch from main to joydddd/stack/14 July 11, 2025 00:32
@joydddd joydddd changed the base branch from joydddd/stack/14 to main July 11, 2025 01:08
@joydddd joydddd force-pushed the joydddd/stack/13 branch 2 times, most recently from 053923a to 1a74563 Compare July 11, 2025 01:08
@joydddd joydddd changed the base branch from main to joydddd/stack/15 July 11, 2025 01:09
@joydddd joydddd force-pushed the joydddd/stack/15 branch from bc6ea11 to 5348847 Compare July 11, 2025 01:16
@joydddd joydddd changed the base branch from joydddd/stack/15 to main July 11, 2025 01:17
@joydddd joydddd force-pushed the joydddd/stack/13 branch from 1a74563 to a82b355 Compare July 11, 2025 01:17
@joydddd joydddd changed the base branch from main to joydddd/stack/15 July 11, 2025 01:17
@joydddd joydddd changed the base branch from joydddd/stack/15 to main July 11, 2025 02:19
@joydddd joydddd force-pushed the joydddd/stack/13 branch from a82b355 to 9f46f56 Compare July 11, 2025 02:19
@joydddd joydddd changed the base branch from main to joydddd/stack/15 July 11, 2025 02:19
@joydddd joydddd marked this pull request as ready for review July 11, 2025 02:19
@joydddd joydddd requested review from jansel, yf225 and drisspg July 11, 2025 02:19
)
N = signal_pad_ptrs.size(0)
for i in hl.grid(pad_shape):
offset = i * 4 # number of btypes of each pointer (torch.int32)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calculating offset to pointers is kinda ugly here and it breaks the no-index calculation / pointer-handling abstraction. The tensor metadata (i.e. size, stride etc.) is lost when passed in as Tensor of data_ptrs, therefore offset calculation need to be explicit here for indirect barriers.

@joydddd joydddd changed the base branch from joydddd/stack/15 to main July 14, 2025 18:18
@joydddd joydddd force-pushed the joydddd/stack/13 branch from 9f46f56 to e646848 Compare July 14, 2025 18:35
Comment on lines +252 to +257
signal_pad_list = [
torch.ones(4, device=DEVICE, dtype=torch.int32) for _ in range(4)
]
signal_pad_ptrs = torch.as_tensor(
[p.data_ptr() for p in signal_pad_list], device=DEVICE, dtype=torch.uint64
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation for multiple signal pads here?

Could we have a multi-dimensional pad instead like:

signal_pad_list = torch.ones([4, 4], device=DEVICE, dtype=torch.int32)

then make pass in an index like (multicast_tile, i)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation is to implementation cross rank sync i.e. each barrier lives on a difference card, and we signal our peers and then wait for signal from our peers to make sure all cards have reached this kernel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See test/test_distributed in #245

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess the tensor of pointers is required in this case -- let's figure out a higher level abstraction to make this cleaner.

@joydddd joydddd force-pushed the joydddd/stack/13 branch from e646848 to ab85dfe Compare July 15, 2025 18:40
@joydddd joydddd requested a review from jansel July 15, 2025 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants