-
Notifications
You must be signed in to change notification settings - Fork 47
Expose RDMA support through Python APIs #462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This pull request was exported from Phabricator. Differential Revision: D76937776 |
be4affa
to
63ce0c9
Compare
Summary: This change brings in RDMA support from Rust to the Python APIs, overwriting the existing stand-in RDMABuffer APIs. There are quite a few changes, so here's a high level overview: ## `monarch_hyperactor` (rust) Within `PyProcMesh`, adds an optional RDMAMangerActor attribute. This mimics the prior functionality - before, a `proc_mesh` (in Python) would unconditionally spin up an `RdmaManagerActor`. Now, a `proc_mesh` will spin up an `RdmaManagerActor` if and only if `tensor_engine` is enabled and `ibverbs` APIs are supported. A few other pieces must be accessible within the RDMA extension library - specifically `Mailbox` / the send caps capabiliites. ## RDMA extension Adds an `extension` folder within `monarch_rdma` which contains the actual bindings. Consistent with other bindings, provides a blocking and non-blocking version. ## RDMA components Corrects behavior of a few edge cases, i.e. `ibverbs_supported` based on the number of devices, supporting a "loopback" case, wherein two actors share an RDMA buffer but are spawned on the same proc. ## Python Removes the stand-in RDMAManagerActor In rdma.py, adds in basic support for RDMABuffer CPU. While CUDA<>CPU|CUDA is implemented in `monarch_rdma`, this was running into issues with MR registration. Logging shows that the code path correctly interprets the address as CUDA, but `cuMemGetHandleForAddressRange` returns the handle/fd as `-1` and a null pointer MR. RDMABuffer will only support CPU for now with GPU support as a follow up, cc dstaay-fb # Test Updates tests for anything touching RDMABuffer. Splits out `test_python_actor` into `test_python_actor` and `test_rdma` now that RDMA is utilizing the backend network. Differential Revision: D76937776
This pull request was exported from Phabricator. Differential Revision: D76937776 |
Summary: This change introduces an `extension` folder for `monarch_rdma`, prepping for Python<>RDMA APIs: - Consistent with other bindings, provides a blocking and non-blocking version. - Corrects behavior of a few edge cases for RDMA components, i.e. `ibverbs_supported` based on the number of devices, supporting a "loopback" case, wherein two actors share an RDMA buffer but are spawned on the same proc. Differential Revision: D76937776
63ce0c9
to
47406f1
Compare
This pull request was exported from Phabricator. Differential Revision: D76937776 |
47406f1
to
aca7f36
Compare
Summary: This change introduces an `extension` folder for `monarch_rdma`, prepping for Python<>RDMA APIs: - Consistent with other bindings, provides a blocking and non-blocking version. - Corrects behavior of a few edge cases for RDMA components, i.e. `ibverbs_supported` based on the number of devices, supporting a "loopback" case, wherein two actors share an RDMA buffer but are spawned on the same proc. Reviewed By: dstaay-fb Differential Revision: D76937776
This pull request was exported from Phabricator. Differential Revision: D76937776 |
Summary: This change introduces an `extension` folder for `monarch_rdma`, prepping for Python<>RDMA APIs: - Consistent with other bindings, provides a blocking and non-blocking version. - Corrects behavior of a few edge cases for RDMA components, i.e. `ibverbs_supported` based on the number of devices, supporting a "loopback" case, wherein two actors share an RDMA buffer but are spawned on the same proc. Reviewed By: dstaay-fb Differential Revision: D76937776
aca7f36
to
ef4acef
Compare
This pull request was exported from Phabricator. Differential Revision: D76937776 |
Summary: This change introduces an `extension` folder for `monarch_rdma`, prepping for Python<>RDMA APIs: - Consistent with other bindings, provides a blocking and non-blocking version. - Corrects behavior of a few edge cases for RDMA components, i.e. `ibverbs_supported` based on the number of devices, supporting a "loopback" case, wherein two actors share an RDMA buffer but are spawned on the same proc. Reviewed By: dstaay-fb Differential Revision: D76937776
ef4acef
to
1e20ff9
Compare
This pull request was exported from Phabricator. Differential Revision: D76937776 |
Summary: This change introduces an `extension` folder for `monarch_rdma`, prepping for Python<>RDMA APIs: - Consistent with other bindings, provides a blocking and non-blocking version. - Corrects behavior of a few edge cases for RDMA components, i.e. `ibverbs_supported` based on the number of devices, supporting a "loopback" case, wherein two actors share an RDMA buffer but are spawned on the same proc. Reviewed By: dstaay-fb Differential Revision: D76937776
1e20ff9
to
f2d0f91
Compare
This pull request was exported from Phabricator. Differential Revision: D76937776 |
Summary:
This change brings in RDMA support from Rust to the Python APIs, overwriting the existing stand-in RDMABuffer APIs.
There are quite a few changes, so here's a high level overview:
monarch_hyperactor
(rust)Within
PyProcMesh
, adds an optional RDMAMangerActor attribute.This mimics the prior functionality - before, a
proc_mesh
(in Python) would unconditionally spin up anRdmaManagerActor
.Now, a
proc_mesh
will spin up anRdmaManagerActor
if and only iftensor_engine
is enabled andibverbs
APIs are supported.A few other pieces must be accessible within the RDMA extension library - specifically
Mailbox
/ the send caps capabiliites.RDMA extension
Adds an
extension
folder withinmonarch_rdma
which contains the actual bindings.Consistent with other bindings, provides a blocking and non-blocking version.
RDMA components
Corrects behavior of a few edge cases, i.e.
ibverbs_supported
based on the number of devices, supporting a "loopback" case, wherein two actors share an RDMA buffer but are spawned on the same proc.Python
Removes the stand-in RDMAManagerActor
In rdma.py, adds in basic support for RDMABuffer CPU.
While CUDA<>CPU|CUDA is implemented in
monarch_rdma
, this was running into issues with MR registration.Logging shows that the code path correctly interprets the address as CUDA, but
cuMemGetHandleForAddressRange
returns the handle/fd as-1
and a null pointer MR.RDMABuffer will only support CPU for now with GPU support as a follow up, cc dstaay-fb
Test
Updates tests for anything touching RDMABuffer. Splits out
test_python_actor
intotest_python_actor
andtest_rdma
now that RDMA is utilizing the backend network.Differential Revision: D76937776