Skip to content

Commit abd3462

Browse files
authored
Merge pull request #5972 from aravindksg/ofi_sep_master
MTL/OFI: Add OFI Scalable Endpoint support
2 parents 3113d20 + 109d056 commit abd3462

File tree

5 files changed

+793
-194
lines changed

5 files changed

+793
-194
lines changed

ompi/mca/mtl/ofi/README

Lines changed: 79 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
OFI MTL
2-
1+
OFI MTL:
2+
--------
33
The OFI MTL supports Libfabric (a.k.a. Open Fabrics Interfaces OFI,
44
https://ofiwg.github.io/libfabric/) tagged APIs (fi_tagged(3)). At
55
initialization time, the MTL queries libfabric for providers supporting tag matching
@@ -9,19 +9,22 @@ The user may modify the OFI provider selection with mca parameters
99
mtl_ofi_provider_include or mtl_ofi_provider_exclude.
1010

1111
PROGRESS:
12+
---------
1213
The MTL registers a progress function to opal_progress. There is currently
1314
no support for asynchronous progress. The progress function reads multiple events
1415
from the OFI provider Completion Queue (CQ) per iteration (defaults to 100, can be
1516
modified with the mca mtl_ofi_progress_event_cnt) and iterates until the
1617
completion queue is drained.
1718

1819
COMPLETIONS:
20+
------------
1921
Each operation uses a request type ompi_mtl_ofi_request_t which includes a reference
20-
to an operation specific completion callback, an MPI request, and a context. The
22+
to an operation specific completion callback, an MPI request, and a context. The
2123
context (fi_context) is used to map completion events with MPI_requests when reading the
2224
CQ.
2325

2426
OFI TAG:
27+
--------
2528
MPI needs to send 96 bits of information per message (32 bits communicator id,
2629
32 bits source rank, 32 bits MPI tag) but OFI only offers 64 bits tags. In
2730
addition, the OFI MTL uses 2 bits of the OFI tag for the synchronous send protocol.
@@ -67,3 +70,76 @@ This is signaled in mem_tag_format (see fi_endpoint(3)) by setting higher order
6770
to zero. In such cases, the OFI MTL will reduce the number of communicator ids supported
6871
by reducing the bits available for the communicator ID field in the OFI tag.
6972

73+
SCALABLE ENDPOINTS:
74+
-------------------
75+
OFI MTL supports OFI Scalable Endpoints feature as a means to improve
76+
multi-threaded application throughput and message rate. Currently the feature
77+
is designed to utilize multiple TX/RX contexts exposed by the OFI provider in
78+
conjunction with a multi-communicator MPI application model. Therefore, new OFI
79+
contexts are created as and when communicators are duplicated in a lazy fashion
80+
instead of creating them all at once during init time and this approach also
81+
favours only creating as many contexts as needed.
82+
83+
1. Multi-communicator model:
84+
With this approach, the application first duplicates the communicators it
85+
wants to use with MPI operations (ideally creating as many communicators as
86+
the number of threads it wants to use to call into MPI). The duplicated
87+
communicators are then used by the corresponding threads to perform MPI
88+
operations. A possible usage scenario could be in an MPI + OMP
89+
application as follows (example limited to 2 ranks):
90+
91+
MPI_Comm dup_comm[n];
92+
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
93+
for (i = 0; i < n; i++) {
94+
MPI_Comm_dup(MPI_COMM_WORLD, &dup_comm[i]);
95+
}
96+
if (rank == 0) {
97+
#pragma omp parallel for private(host_sbuf, host_rbuf) num_threads(n)
98+
for (i = 0; i < n ; i++) {
99+
MPI_Send(host_sbuf, MYBUFSIZE, MPI_CHAR,
100+
1, MSG_TAG, dup_comm[i]);
101+
MPI_Recv(host_rbuf, MYBUFSIZE, MPI_CHAR,
102+
1, MSG_TAG, dup_comm[i], &status);
103+
}
104+
} else if (rank == 1) {
105+
#pragma omp parallel for private(status, host_sbuf, host_rbuf) num_threads(n)
106+
for (i = 0; i < n ; i++) {
107+
MPI_Recv(host_rbuf, MYBUFSIZE, MPI_CHAR,
108+
0, MSG_TAG, dup_comm[i], &status);
109+
MPI_Send(host_sbuf, MYBUFSIZE, MPI_CHAR,
110+
0, MSG_TAG, dup_comm[i]);
111+
}
112+
}
113+
114+
2. MCA variable:
115+
To utilize the feature, the following MCA variable needs to be set:
116+
mtl_ofi_thread_grouping:
117+
This MCA variable is at the OFI MTL level and needs to be set to switch
118+
the feature on.
119+
120+
Default: 0
121+
122+
It is not recommended to set the MCA variable for:
123+
- Multi-threaded MPI applications not following multi-communicator approach.
124+
- Applications that have multiple threads using a single communicator as
125+
it may degrade performance.
126+
127+
Command-line syntax to set the MCA variable:
128+
"-mca mtl_ofi_thread_grouping 1"
129+
130+
3. Notes on performance:
131+
- OFI MTL will create as many TX/RX contexts as allowed by an underlying
132+
provider (each provider may have different thresholds). Once the threshold
133+
is exceeded, contexts are used in a round-robin fashion which leads to
134+
resource sharing among threads. Therefore locks are required to guard
135+
against race conditions. For performance, it is recommended to have
136+
137+
Number of communicators = Number of contexts
138+
139+
For example, when using PSM2 provider, the number of contexts is dictated
140+
by the Intel Omni-Path HFI1 driver module.
141+
142+
- For applications using a single thread with multiple communicators and MCA
143+
variable "mtl_ofi_thread_grouping" set to 1, the MTL will use multiple
144+
contexts, but the benefits may be negligible as only one thread is driving
145+
progress.

ompi/mca/mtl/ofi/help-mtl-ofi.txt

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,3 +24,19 @@ fi_info -v -p %s
2424

2525
Local host: %s
2626
Location: %s:%d
27+
28+
[SEP unavailable]
29+
Scalable Endpoint feature is required for Thread Grouping feature to work
30+
but it is not supported by %s provider. Try disabling this feature.
31+
32+
Local host: %s
33+
Location: %s:%d
34+
35+
[SEP ctxt limit]
36+
Reached limit (%d) for number of OFI contexts that can be opened with the
37+
provider. Creating new communicators beyond this limit is possible but
38+
they will re-use existing contexts in round-robin fashion.
39+
Using new communicators beyond the limit will impact performance.
40+
41+
Local host: %s
42+
Location: %s:%d

0 commit comments

Comments
 (0)