You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary:
This change brings in RDMA support from Rust to the Python APIs, overwriting the existing stand-in RDMABuffer APIs.
There are quite a few changes, so here's a high level overview:
## `monarch_hyperactor` (rust)
Within `PyProcMesh`, adds an optional RDMAMangerActor attribute.
This mimics the prior functionality - before, a `proc_mesh` (in Python) would unconditionally spin up an `RdmaManagerActor`.
Now, a `proc_mesh` will spin up an `RdmaManagerActor` if and only if `tensor_engine` is enabled and `ibverbs` APIs are supported.
A few other pieces must be accessible within the RDMA extension library - specifically `Mailbox` / the send caps capabiliites.
## RDMA extension
Adds an `extension` folder within `monarch_rdma` which contains the actual bindings.
Consistent with other bindings, provides a blocking and non-blocking version.
## RDMA components
Corrects behavior of a few edge cases, i.e. `ibverbs_supported` based on the number of devices, supporting a "loopback" case, wherein two actors share an RDMA buffer but are spawned on the same proc.
## Python
Removes the stand-in RDMAManagerActor
In rdma.py, adds in basic support for RDMABuffer CPU.
While CUDA<>CPU|CUDA is implemented in `monarch_rdma`, this was running into issues with MR registration.
Logging shows that the code path correctly interprets the address as CUDA, but `cuMemGetHandleForAddressRange` returns the handle/fd as `-1` and a null pointer MR.
RDMABuffer will only support CPU for now with GPU support as a follow up, cc dstaay-fb
# Test
Updates tests for anything touching RDMABuffer. Splits out `test_python_actor` into `test_python_actor` and `test_rdma` now that RDMA is utilizing the backend network.
Differential Revision: D76937776
0 commit comments