Skip to content

Commit 3aba0bb

Browse files
committed
opal/ofi: update nic selection function doc
The documentation needs an update to reflect latest implementation. The original cpuset matching logic has been replaced with a new distance calculation algorithm. This change also clarifies the round-robin selection process when we need to break a tie. Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
1 parent b061f96 commit 3aba0bb

File tree

2 files changed

+50
-46
lines changed

2 files changed

+50
-46
lines changed

opal/mca/common/ofi/common_ofi.c

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -623,10 +623,10 @@ static int get_provider_distance(struct fi_info *provider, hwloc_topology_t topo
623623
/**
624624
* @brief Get the nearest device to the current thread
625625
*
626-
* Use the PMIx server or calculate the device distances, then out of the set of
627-
* returned distances find the subset of the nearest devices. This can be
628-
* 0 or more.
629-
* If there are multiple equidistant devices, break the tie using the rank.
626+
* Compute the distances from the current thread to each NIC in provider_list,
627+
* and select the NIC with the shortest distance.
628+
* If there are multiple equidistant devices, break the tie using local rank
629+
* to balance NIC utilization.
630630
*
631631
* @param[in] topoloy hwloc topology
632632
* @param[in] provider_list List of providers to select from
@@ -898,6 +898,10 @@ struct fi_info *opal_common_ofi_select_provider(struct fi_info *provider_list,
898898
package_rank = get_package_rank(process_info);
899899

900900
#if OPAL_OFI_PCI_DATA_AVAILABLE
901+
/**
902+
* If provider PCI BDF information is available, we calculate its physical distance
903+
* to the current process, and select the provider with the shortest distance.
904+
*/
901905
ret = get_nearest_nic(opal_hwloc_topology, provider_list, num_providers, package_rank,
902906
&provider);
903907
if (OPAL_SUCCESS == ret) {

opal/mca/common/ofi/common_ofi.h

Lines changed: 42 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -103,47 +103,47 @@ OPAL_DECLSPEC int opal_common_ofi_is_in_list(char **list, char *item);
103103
/**
104104
* Selects NIC (provider) based on hardware locality
105105
*
106-
* In multi-nic situations, use hardware topology to pick the "best"
107-
* of the selected NICs.
108-
* There are 3 main cases that this covers:
109-
*
110-
* 1. If the first provider passed into this function is the only valid
111-
* provider, this provider is returned.
112-
*
113-
* 2. If there is more than 1 provider that matches the type of the first
114-
* provider in the list, and the BDF data
115-
* is available then a provider is selected based on locality of device
116-
* cpuset and process cpuset and tries to ensure that processes
117-
* are distributed evenly across NICs. This has two separate
118-
* cases:
119-
*
120-
* i. There is one or more provider local to the process:
121-
*
122-
* (local rank % number of providers of the same type
123-
* that share the process cpuset) is used to select one
124-
* of these providers.
125-
*
126-
* ii. There is no provider that is local to the process:
127-
*
128-
* (local rank % number of providers of the same type)
129-
* is used to select one of these providers
130-
*
131-
* 3. If there is more than 1 providers of the same type in the
132-
* list, and the BDF data is not available (the ofi version does
133-
* not support fi_info.nic or the provider does not support BDF)
134-
* then (local rank % number of providers of the same type) is
135-
* used to select one of these providers
136-
*
137-
* @param provider_list (IN) struct fi_info* An initially selected
138-
* provider NIC. The provider name and
139-
* attributes are used to restrict NIC
140-
* selection. This provider is returned if the
141-
* NIC selection fails.
142-
*
143-
* @param provider (OUT) struct fi_info* object with the selected
144-
* provider if the selection succeeds
145-
* if the selection fails, returns the fi_info
146-
* object that was initially provided.
106+
* The selection is based on the following priority:
107+
*
108+
* Single-NIC:
109+
*
110+
* If only 1 provider is available, always return that provider.
111+
*
112+
* Multi-NIC:
113+
*
114+
* 1. If the process is NOT bound, pick a NIC using (local rank % number
115+
* of providers of the same type). This gives a fair chance to each
116+
* qualified NIC and balances overall utilization.
117+
*
118+
* 2. If the process is bound, we compare providers in the list that have
119+
* the same type as the first provider, and find the provider with the
120+
* shortest distance to the current process.
121+
*
122+
* i. If the provider has PCI BDF data, we attempt to compute the
123+
* distance between the NIC and the current process cpuset. The NIC
124+
* with the shortest distance is returned.
125+
*
126+
* * For equidistant NICs, we select a NIC in round-robin fashion
127+
* using the package rank of the current process, i.e. (package
128+
* rank % number of providers with the same distance).
129+
*
130+
* ii. If we cannot compute the distance between the NIC and the
131+
* current process, e.g. PCI BDF data is not available, a NIC will be
132+
* selected in a round-robin fashion using package rank, i.e. (package
133+
* rank % number of providers of the same type).
134+
*
135+
* @param[in] provider_list struct fi_info* An initially selected
136+
* provider NIC. The provider name and
137+
* attributes are used to restrict NIC
138+
* selection. This provider is returned if the
139+
* NIC selection fails.
140+
*
141+
* @param[in] process_info opal_process_info_t* The current process info
142+
*
143+
* @param[out] provider struct fi_info* object with the selected
144+
* provider if the selection succeeds
145+
* if the selection fails, returns the fi_info
146+
* object that was initially provided.
147147
*
148148
* All errors should be recoverable and will return the initially provided
149149
* provider. However, if an error occurs we can no longer guarantee
@@ -152,7 +152,7 @@ OPAL_DECLSPEC int opal_common_ofi_is_in_list(char **list, char *item);
152152
*
153153
*/
154154
OPAL_DECLSPEC struct fi_info *opal_common_ofi_select_provider(struct fi_info *provider_list,
155-
opal_process_info_t *process_info);
155+
opal_process_info_t *process_info);
156156

157157
/**
158158
* Obtain EP endpoint name

0 commit comments

Comments
 (0)