Skip to content

Expose RDMA support through Python APIs #462

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

allenwang28
Copy link
Contributor

Summary:
This change brings in RDMA support from Rust to the Python APIs, overwriting the existing stand-in RDMABuffer APIs.

There are quite a few changes, so here's a high level overview:

monarch_hyperactor (rust)

Within PyProcMesh, adds an optional RDMAMangerActor attribute.

This mimics the prior functionality - before, a proc_mesh (in Python) would unconditionally spin up an RdmaManagerActor.

Now, a proc_mesh will spin up an RdmaManagerActor if and only if tensor_engine is enabled and ibverbs APIs are supported.

A few other pieces must be accessible within the RDMA extension library - specifically Mailbox / the send caps capabiliites.

RDMA extension

Adds an extension folder within monarch_rdma which contains the actual bindings.

Consistent with other bindings, provides a blocking and non-blocking version.

RDMA components

Corrects behavior of a few edge cases, i.e. ibverbs_supported based on the number of devices, supporting a "loopback" case, wherein two actors share an RDMA buffer but are spawned on the same proc.

Python

Removes the stand-in RDMAManagerActor

In rdma.py, adds in basic support for RDMABuffer CPU.

While CUDA<>CPU|CUDA is implemented in monarch_rdma, this was running into issues with MR registration.

Logging shows that the code path correctly interprets the address as CUDA, but cuMemGetHandleForAddressRange returns the handle/fd as -1 and a null pointer MR.

RDMABuffer will only support CPU for now with GPU support as a follow up, cc dstaay-fb

Test

Updates tests for anything touching RDMABuffer. Splits out test_python_actor into test_python_actor and test_rdma now that RDMA is utilizing the backend network.

Differential Revision: D76937776

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 8, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76937776

allenwang28 added a commit to allenwang28/monarch-1 that referenced this pull request Jul 9, 2025
Summary:

This change brings in RDMA support from Rust to the Python APIs, overwriting the existing stand-in RDMABuffer APIs.

There are quite a few changes, so here's a high level overview:

## `monarch_hyperactor` (rust)
Within `PyProcMesh`, adds an optional RDMAMangerActor attribute.

This mimics the prior functionality - before, a `proc_mesh` (in Python) would unconditionally spin up an `RdmaManagerActor`.

Now, a `proc_mesh` will spin up an `RdmaManagerActor` if and only if `tensor_engine` is enabled and `ibverbs` APIs are supported.

A few other pieces must be accessible within the RDMA extension library - specifically `Mailbox` / the send caps capabiliites.

## RDMA extension
Adds an `extension` folder within `monarch_rdma` which contains the actual bindings.

Consistent with other bindings, provides a blocking and non-blocking version.


## RDMA components
Corrects behavior of a few edge cases, i.e. `ibverbs_supported` based on the number of devices, supporting a "loopback" case, wherein two actors share an RDMA buffer but are spawned on the same proc.


## Python
Removes the stand-in RDMAManagerActor

In rdma.py, adds in basic support for RDMABuffer CPU.

While CUDA<>CPU|CUDA is implemented in `monarch_rdma`, this was running into issues with MR registration.

Logging shows that the code path correctly interprets the address as CUDA, but `cuMemGetHandleForAddressRange` returns the handle/fd as `-1` and a null pointer MR.

RDMABuffer will only support CPU for now with GPU support as a follow up, cc dstaay-fb


# Test
Updates tests for anything touching RDMABuffer. Splits out `test_python_actor` into `test_python_actor` and `test_rdma` now that RDMA is utilizing the backend network.

Differential Revision: D76937776
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76937776

allenwang28 added a commit to allenwang28/monarch-1 that referenced this pull request Jul 9, 2025
Summary:

This change introduces an `extension` folder for `monarch_rdma`, prepping for Python<>RDMA APIs:
- Consistent with other bindings, provides a blocking and non-blocking version.
- Corrects behavior of a few edge cases for RDMA components, i.e. `ibverbs_supported` based on the number of devices, supporting a "loopback" case, wherein two actors share an RDMA buffer but are spawned on the same proc.

Differential Revision: D76937776
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76937776

allenwang28 added a commit to allenwang28/monarch-1 that referenced this pull request Jul 10, 2025
Summary:

This change introduces an `extension` folder for `monarch_rdma`, prepping for Python<>RDMA APIs:
- Consistent with other bindings, provides a blocking and non-blocking version.
- Corrects behavior of a few edge cases for RDMA components, i.e. `ibverbs_supported` based on the number of devices, supporting a "loopback" case, wherein two actors share an RDMA buffer but are spawned on the same proc.

Reviewed By: dstaay-fb

Differential Revision: D76937776
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76937776

allenwang28 added a commit to allenwang28/monarch-1 that referenced this pull request Jul 10, 2025
Summary:

This change introduces an `extension` folder for `monarch_rdma`, prepping for Python<>RDMA APIs:
- Consistent with other bindings, provides a blocking and non-blocking version.
- Corrects behavior of a few edge cases for RDMA components, i.e. `ibverbs_supported` based on the number of devices, supporting a "loopback" case, wherein two actors share an RDMA buffer but are spawned on the same proc.

Reviewed By: dstaay-fb

Differential Revision: D76937776
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76937776

allenwang28 added a commit to allenwang28/monarch-1 that referenced this pull request Jul 10, 2025
Summary:

This change introduces an `extension` folder for `monarch_rdma`, prepping for Python<>RDMA APIs:
- Consistent with other bindings, provides a blocking and non-blocking version.
- Corrects behavior of a few edge cases for RDMA components, i.e. `ibverbs_supported` based on the number of devices, supporting a "loopback" case, wherein two actors share an RDMA buffer but are spawned on the same proc.

Reviewed By: dstaay-fb

Differential Revision: D76937776
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76937776

Summary:

This change introduces an `extension` folder for `monarch_rdma`, prepping for Python<>RDMA APIs:
- Consistent with other bindings, provides a blocking and non-blocking version.
- Corrects behavior of a few edge cases for RDMA components, i.e. `ibverbs_supported` based on the number of devices, supporting a "loopback" case, wherein two actors share an RDMA buffer but are spawned on the same proc.

Reviewed By: dstaay-fb

Differential Revision: D76937776
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76937776

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants