Skip to content

Commit 37f9aff

Browse files
committed
mtl/ofi: Add MCA variables to enable SEP and to request number of OFI contexts
Moving to a model where we have users actively _enable_ SEP feature for use rather than opening SEP by default if provider supports it. This allows us to not regress (either functionally or for performance reasons) any apps that were working correctly on regular endpoints. Also, providing MCA to specify number of OFI contexts to create and default this value to 1 (Given btl/ofi also creates one by default, this reduces the incidence of a scenario where we allocate all available contexts by default and if btl/ofi asks for one more, then provider breaks as it doesn't support it). While at it, spruce up README on SEP content. Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
1 parent d1fd1f4 commit 37f9aff

File tree

5 files changed

+203
-121
lines changed

5 files changed

+203
-121
lines changed

ompi/mca/mtl/ofi/README

Lines changed: 44 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -111,11 +111,22 @@ favours only creating as many contexts as needed.
111111
}
112112
}
113113

114-
2. MCA variable:
114+
2. MCA variables:
115115
To utilize the feature, the following MCA variable needs to be set:
116+
mtl_ofi_enable_sep:
117+
This MCA variable needs to be set to enable the use of Scalable Endpoints
118+
feature in the OFI MTL. The underlying provider is also checked to ensure the
119+
feature is supported. If the provider chosen does not support it, user needs
120+
to either set this variable to 0 or select different provider which supports
121+
the feature.
122+
123+
Default: 0
124+
125+
Command-line syntax:
126+
"-mca mtl_ofi_enable_sep 1"
127+
116128
mtl_ofi_thread_grouping:
117-
This MCA variable is at the OFI MTL level and needs to be set to switch
118-
the feature on.
129+
This MCA variable needs to be set to switch Thread Grouping feature on.
119130

120131
Default: 0
121132

@@ -124,21 +135,46 @@ To utilize the feature, the following MCA variable needs to be set:
124135
- Applications that have multiple threads using a single communicator as
125136
it may degrade performance.
126137

127-
Command-line syntax to set the MCA variable:
128-
"-mca mtl_ofi_thread_grouping 1"
138+
Command-line syntax:
139+
"-mca mtl_ofi_thread_grouping 1"
140+
141+
mtl_ofi_num_ctxts:
142+
MCA variable allows user to set the number of OFI contexts the applications
143+
expects to use. For multi-threaded applications using Thread Grouping
144+
feature, this number should be set to the number of user threads that will
145+
call into MPI. For single-threaded applications one OFI context is
146+
sufficient.
147+
148+
Default: 1
149+
150+
Command-line syntax:
151+
"-mca mtl_ofi_num_ctxts N" [ N: number of OFI contexts required by
152+
application ]
129153

130154
3. Notes on performance:
131-
- OFI MTL will create as many TX/RX contexts as allowed by an underlying
132-
provider (each provider may have different thresholds). Once the threshold
155+
- OFI MTL will create as many TX/RX contexts as set by MCA mtl_ofi_num_ctxts.
156+
The number of contexts that can be created is also limited by the underlying
157+
provider as each provider may have different thresholds. Once the threshold
133158
is exceeded, contexts are used in a round-robin fashion which leads to
134159
resource sharing among threads. Therefore locks are required to guard
135160
against race conditions. For performance, it is recommended to have
136161

137-
Number of communicators = Number of contexts
162+
Number of threads = Number of communicators = Number of contexts
138163

139164
For example, when using PSM2 provider, the number of contexts is dictated
140165
by the Intel Omni-Path HFI1 driver module.
141166

167+
- OPAL layer allows for multiple threads to enter progress simultaneously. To
168+
enable this feature, user needs to set MCA variable
169+
"max_thread_in_progress". When using Thread Grouping feature, it is
170+
recommended to set this MCA parameter to the number of threads expected to
171+
call into MPI as it provides performance benefits.
172+
173+
Command-line syntax:
174+
"-mca opal_max_thread_in_progress N" [ N: number of threads expected to
175+
make MPI calls ]
176+
Default: 1
177+
142178
- For applications using a single thread with multiple communicators and MCA
143179
variable "mtl_ofi_thread_grouping" set to 1, the MTL will use multiple
144180
contexts, but the benefits may be negligible as only one thread is driving

ompi/mca/mtl/ofi/help-mtl-ofi.txt

Lines changed: 32 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -26,17 +26,42 @@ fi_info -v -p %s
2626
Location: %s:%d
2727

2828
[SEP unavailable]
29-
Scalable Endpoint feature is required for Thread Grouping feature to work
30-
but it is not supported by %s provider. Try disabling this feature.
29+
Scalable Endpoint feature is enabled by the user but it is not supported by
30+
%s provider. Try disabling this feature or use a different provider that
31+
supports it using mtl_ofi_provider_include.
3132

3233
Local host: %s
3334
Location: %s:%d
3435

35-
[SEP ctxt limit]
36-
Reached limit (%d) for number of OFI contexts that can be opened with the
37-
provider. Creating new communicators beyond this limit is possible but
38-
they will re-use existing contexts in round-robin fashion.
39-
Using new communicators beyond the limit will impact performance.
36+
[SEP required]
37+
Scalable Endpoint feature is required for Thread Grouping feature to work.
38+
Please try enabling Scalable Endpoints using mtl_ofi_enable_sep.
39+
40+
Local host: %s
41+
Location: %s:%d
42+
43+
[SEP thread grouping ctxt limit]
44+
Reached limit (%d) for number of OFI contexts set by mtl_ofi_num_ctxts.
45+
Please set mtl_ofi_num_ctxts to a larger value if you need more contexts.
46+
If an MPI application creates more communicators than mtl_ofi_num_ctxts,
47+
OFI MTL will make the new communicators re-use existing contexts in
48+
round-robin fashion which will impact performance.
49+
50+
Local host: %s
51+
Location: %s:%d
52+
53+
[Local ranks exceed ofi contexts]
54+
Number of local ranks exceed the number of available OFI contexts in %s
55+
provider and we cannot provision enough contexts for each rank. Try disabling
56+
Scalable Endpoint feature.
57+
58+
Local host: %s
59+
Location: %s:%d
60+
61+
[Ctxts exceeded available]
62+
User requested for more than available contexts from provider. Limiting
63+
to max allowed (%d). Contexts will be re used in round-robin fashion if there
64+
are more threads than the available contexts.
4065

4166
Local host: %s
4267
Location: %s:%d

ompi/mca/mtl/ofi/mtl_ofi.h

Lines changed: 36 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -327,16 +327,7 @@ ompi_mtl_ofi_isend_callback(struct fi_cq_tagged_entry *wc,
327327

328328
#define MTL_OFI_MAP_COMM_TO_CONTEXT(comm_id, ctxt_id) \
329329
do { \
330-
if (ompi_mtl_ofi.thread_grouping && \
331-
(!ompi_mtl_ofi.threshold_comm_context_id || \
332-
((uint32_t) ompi_mtl_ofi.threshold_comm_context_id > comm_id))) { \
333-
ctxt_id = ompi_mtl_ofi.comm_to_context[comm_id]; \
334-
} else if (ompi_mtl_ofi.thread_grouping) { \
335-
/* Round-robin assignment of contexts if threshold is reached */ \
336-
ctxt_id = comm_id % ompi_mtl_ofi.total_ctxts_used; \
337-
} else { \
338-
ctxt_id = 0; \
339-
} \
330+
ctxt_id = ompi_mtl_ofi.comm_to_context[comm_id]; \
340331
} while (0);
341332

342333
__opal_attribute_always_inline__ static inline int
@@ -348,40 +339,40 @@ ompi_mtl_ofi_ssend_recv(ompi_mtl_ofi_request_t *ack_req,
348339
uint64_t *match_bits,
349340
int tag)
350341
{
351-
ssize_t ret = OMPI_SUCCESS;
352-
int ctxt_id = 0;
342+
ssize_t ret = OMPI_SUCCESS;
343+
int ctxt_id = 0;
353344

354-
MTL_OFI_MAP_COMM_TO_CONTEXT(comm->c_contextid, ctxt_id);
355-
set_thread_context(ctxt_id);
345+
MTL_OFI_MAP_COMM_TO_CONTEXT(comm->c_contextid, ctxt_id);
346+
set_thread_context(ctxt_id);
356347

357-
ack_req = malloc(sizeof(ompi_mtl_ofi_request_t));
358-
assert(ack_req);
348+
ack_req = malloc(sizeof(ompi_mtl_ofi_request_t));
349+
assert(ack_req);
359350

360-
ack_req->parent = ofi_req;
361-
ack_req->event_callback = ompi_mtl_ofi_send_ack_callback;
362-
ack_req->error_callback = ompi_mtl_ofi_send_ack_error_callback;
351+
ack_req->parent = ofi_req;
352+
ack_req->event_callback = ompi_mtl_ofi_send_ack_callback;
353+
ack_req->error_callback = ompi_mtl_ofi_send_ack_error_callback;
363354

364-
ofi_req->completion_count += 1;
355+
ofi_req->completion_count += 1;
365356

366-
MTL_OFI_RETRY_UNTIL_DONE(fi_trecv(ompi_mtl_ofi.ofi_ctxt[ctxt_id].rx_ep,
367-
NULL,
368-
0,
369-
NULL,
370-
*src_addr,
371-
*match_bits | ompi_mtl_ofi.sync_send_ack,
372-
0, /* Exact match, no ignore bits */
373-
(void *) &ack_req->ctx), ret);
374-
if (OPAL_UNLIKELY(0 > ret)) {
375-
opal_output_verbose(1, ompi_mtl_base_framework.framework_output,
376-
"%s:%d: fi_trecv failed: %s(%zd)",
377-
__FILE__, __LINE__, fi_strerror(-ret), ret);
378-
free(ack_req);
379-
return ompi_mtl_ofi_get_error(ret);
380-
}
357+
MTL_OFI_RETRY_UNTIL_DONE(fi_trecv(ompi_mtl_ofi.ofi_ctxt[ctxt_id].rx_ep,
358+
NULL,
359+
0,
360+
NULL,
361+
*src_addr,
362+
*match_bits | ompi_mtl_ofi.sync_send_ack,
363+
0, /* Exact match, no ignore bits */
364+
(void *) &ack_req->ctx), ret);
365+
if (OPAL_UNLIKELY(0 > ret)) {
366+
opal_output_verbose(1, ompi_mtl_base_framework.framework_output,
367+
"%s:%d: fi_trecv failed: %s(%zd)",
368+
__FILE__, __LINE__, fi_strerror(-ret), ret);
369+
free(ack_req);
370+
return ompi_mtl_ofi_get_error(ret);
371+
}
381372

382-
/* The SYNC_SEND tag bit is set for the send operation only.*/
383-
MTL_OFI_SET_SYNC_SEND(*match_bits);
384-
return OMPI_SUCCESS;
373+
/* The SYNC_SEND tag bit is set for the send operation only.*/
374+
MTL_OFI_SET_SYNC_SEND(*match_bits);
375+
return OMPI_SUCCESS;
385376
}
386377

387378
__opal_attribute_always_inline__ static inline int
@@ -1242,13 +1233,15 @@ static int ompi_mtl_ofi_init_contexts(struct mca_mtl_base_module_t *mtl,
12421233
}
12431234

12441235
/*
1245-
* We only create upto Max number of contexts allowed by provider.
1236+
* We only create upto Max number of contexts asked for by the user.
12461237
* If user enables thread grouping feature and creates more number of
1247-
* communicators than we have contexts, then we set the threshold
1248-
* context_id so we know to use context 0 for operations involving these
1249-
* "extra" communicators.
1238+
* communicators than available contexts, then we set the threshold
1239+
* context_id so that new communicators created beyond the threshold
1240+
* will be assigned to contexts in a round-robin fashion.
12501241
*/
1251-
if (ompi_mtl_ofi.max_ctx_cnt <= ctxt_id) {
1242+
if (ompi_mtl_ofi.num_ofi_contexts <= ompi_mtl_ofi.total_ctxts_used) {
1243+
ompi_mtl_ofi.comm_to_context[comm->c_contextid] = comm->c_contextid %
1244+
ompi_mtl_ofi.total_ctxts_used;
12521245
if (!ompi_mtl_ofi.threshold_comm_context_id) {
12531246
ompi_mtl_ofi.threshold_comm_context_id = comm->c_contextid;
12541247

0 commit comments

Comments
 (0)