Skip to content

Complete rework of the UCT BTL #13335

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

hjelmn
Copy link
Member

@hjelmn hjelmn commented Jul 15, 2025

This PR contains a number of changes related to the UCT btl, these include:

  • Allow alternate connection methods for wiring up connect-to-endpoint transports. This includes using tcp to wire up ib. This may be needed depending on the level of UD support provided by a transport.
  • Clean up of modex code to improve readability.
  • Fix for a connection management issue that could cause accessing already-freed resources. This is separate from the other CLs but was discovered in the middle of the rework. I can take the time to extract if if needed.
  • Split the initialization code into a discovery and initialization phase. This is used to 1) check what memory domains are available, and 2) figure out network defaults. This code enables support for new MCA variables that allow controlling memory-domain specific behavior (since each memory domain may be a different btl).

It is possible that the current memory domain does not have an adequate
transport for forming endpoint to endpoint connections. When this is the case
the btl will fail to function. To support these situations this CL adds
support for using an alternate transport (usually tcp) which can be used to
make the endpoint connections.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

c089b67: btl/uct: move device context code to a new file

  • check_signed_off: does not contain a valid Signed-off-by line

b3c3646: btl/uct: complete rework of descovery and initiali...

  • check_signed_off: does not contain a valid Signed-off-by line

114c30a: btl/uct: move uct_component from the module to mca...

  • check_signed_off: does not contain a valid Signed-off-by line

9abd9a1: btl/uct: keep UCT component list around after init...

  • check_signed_off: does not contain a valid Signed-off-by line

3a47090: btl/uct: put active tls in a list on the mca_btl_u...

  • check_signed_off: does not contain a valid Signed-off-by line

ab6f5be: btl/uct: module md_name to mca_btl_uct_md_t

  • check_signed_off: does not contain a valid Signed-off-by line

dcdf135: btl/uct: remove the need for a BTL module for conn...

  • check_signed_off: does not contain a valid Signed-off-by line

53d1b4d: btl/uct: move the async context from the module to...

  • check_signed_off: does not contain a valid Signed-off-by line

5a4f8a0: btl/uct: downgrade endpoing lock from recursive

  • check_signed_off: does not contain a valid Signed-off-by line

7b87df1: btl/uct: fix some errors around connection endpoin...

  • check_signed_off: does not contain a valid Signed-off-by line

5fb0d62: btl/uct: change mca_btl_uct_tl_t uct_dev_contexts ...

  • check_signed_off: does not contain a valid Signed-off-by line

a0f60ae: btl/uct: move tl attributes off of tl context stru...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

hjelmn added 12 commits July 15, 2025 15:11
In theory the tl attributes do not differ betweeen contexts so query them once
when the tl is created not once per context. This removes the need to allocate
the first context so that code has also been removed.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
The btl always allocates the maximum number of contexts. This is not a
significant amount of memory. Rather than reduce it to be based on the
configured maximum number of contexts it makes sense to just make it an array
and remove the extra indirection when accessing the contexts.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
This commit fixes a couple of bugs discovered using rma-mt:

 - Do not call mca_btl_uct_endpoint_set_flag before sending a message on the
   connection endpoint. This method may cause the release of the connection
   endpoint (cached on the BTL endpoint). If this happens it would lead to a
   SEGV.

 - Flush the endpoint only when it is being released. There is no need to do so
   on every send. Releasing the endpoint without flushing it may lead to it
   being destroyed while still processing data.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
Recursive locks are not needed for the endpoint lock. This commit modifies the
lock to be opal_mutex_t.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
There is no real benefit from sharing the async context between tls. Given this
and some other changes that will be made it makes sense to move it from the
module to the tl.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
Connection TLs are only used to form connections for connect-to-endpoint TLs.
They do not need to belong to the same memory domain as the one they are used
with so there is no need to rely on a BTL module. This commit moves the
pending_connection_reqs to the tl and changes the code to support a NULL
module for the connection tl.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
More cleanup preparing to separate connection-only tls from their module.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
This simplifies the code a bit by moving mca_btl_uct_tl_t ownership to the
mca_btl_uct_md_t class.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
From what I can tell it is not ok to be releasing this list while the components
within the list are still in use. To be safe keep the list around until the uct
component is closed.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
There is an issue with btl/uct which prevents the usage of the standard
btl_uct_ MCA variables (eager limit, flags, etc). Because of the way the btl was
written these values are all determined directly from UCT and can not be changed
using the MCA variable interface. To address this issue this commit breaks apart
the initialization code and separates out the pieces that are necessary for
discovery only. The discovery pieces now use a new set of variables that include
the memory domain name and directly control the behavior for BTLs on that
memory domain as well as enabling the usage of the btl_uct variable to control
the defaults for these variables.

Example, using memory domain irdma0 will create variables:
btl_uct_irdma0_eager_limit, btl_uct_irdma0_max_send_size, etc.

The defaults will be based on what is reported by UCT and the user can set the
values to a subset of what UCT reports. For example, if the max send size for
the hardware is 8192B then it can be set to anything up to and including that
value. The same is true for feature flags, if the hardware supports only some
btl atomics or operations the user can specify a subset of them (others will
be ignored).

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
There is a specific header for device contexts so it makes sense to move the
context-specific code to a matching C file. No changes in this other than moving
code around.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
@hjelmn hjelmn force-pushed the complete_rework_of_the_btl_uct_initialization_and_discovery_code_to_support_additional_configuration_methods_and_improve_overall_code branch from c089b67 to 349db3b Compare July 15, 2025 22:12
@hjelmn
Copy link
Member Author

hjelmn commented Jul 16, 2025

I know this will be a bit difficult to review. Looking at how I might be able to make it more reviewable. The first and second commit might be able to go in on their own but I improved the "used a different tl" logic in another commit. Was kind of hard working on this with the VMs being deleted once every 5 days and doing more defensive commits than having a good breakdown. I can probably split off creating the new logic vs removing the old. Will see.

What I really need is some verification of this on Mellanox hardware. This code greatly improves what we need for our RoCE but I don't want to break IB and I don't have anything to test with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant