Skip to content

Commit dc6eb5d

Browse files
authored
Merge pull request #6226 from aravindksg/sep_mca
mtl/ofi: Add MCA variables to enable SEP and to request OFI contexts
2 parents d1fd1f4 + 37f9aff commit dc6eb5d

File tree

5 files changed

+203
-121
lines changed

5 files changed

+203
-121
lines changed

ompi/mca/mtl/ofi/README

Lines changed: 44 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -111,11 +111,22 @@ favours only creating as many contexts as needed.
111111
}
112112
}
113113

114-
2. MCA variable:
114+
2. MCA variables:
115115
To utilize the feature, the following MCA variable needs to be set:
116+
mtl_ofi_enable_sep:
117+
This MCA variable needs to be set to enable the use of Scalable Endpoints
118+
feature in the OFI MTL. The underlying provider is also checked to ensure the
119+
feature is supported. If the provider chosen does not support it, user needs
120+
to either set this variable to 0 or select different provider which supports
121+
the feature.
122+
123+
Default: 0
124+
125+
Command-line syntax:
126+
"-mca mtl_ofi_enable_sep 1"
127+
116128
mtl_ofi_thread_grouping:
117-
This MCA variable is at the OFI MTL level and needs to be set to switch
118-
the feature on.
129+
This MCA variable needs to be set to switch Thread Grouping feature on.
119130

120131
Default: 0
121132

@@ -124,21 +135,46 @@ To utilize the feature, the following MCA variable needs to be set:
124135
- Applications that have multiple threads using a single communicator as
125136
it may degrade performance.
126137

127-
Command-line syntax to set the MCA variable:
128-
"-mca mtl_ofi_thread_grouping 1"
138+
Command-line syntax:
139+
"-mca mtl_ofi_thread_grouping 1"
140+
141+
mtl_ofi_num_ctxts:
142+
MCA variable allows user to set the number of OFI contexts the applications
143+
expects to use. For multi-threaded applications using Thread Grouping
144+
feature, this number should be set to the number of user threads that will
145+
call into MPI. For single-threaded applications one OFI context is
146+
sufficient.
147+
148+
Default: 1
149+
150+
Command-line syntax:
151+
"-mca mtl_ofi_num_ctxts N" [ N: number of OFI contexts required by
152+
application ]
129153

130154
3. Notes on performance:
131-
- OFI MTL will create as many TX/RX contexts as allowed by an underlying
132-
provider (each provider may have different thresholds). Once the threshold
155+
- OFI MTL will create as many TX/RX contexts as set by MCA mtl_ofi_num_ctxts.
156+
The number of contexts that can be created is also limited by the underlying
157+
provider as each provider may have different thresholds. Once the threshold
133158
is exceeded, contexts are used in a round-robin fashion which leads to
134159
resource sharing among threads. Therefore locks are required to guard
135160
against race conditions. For performance, it is recommended to have
136161

137-
Number of communicators = Number of contexts
162+
Number of threads = Number of communicators = Number of contexts
138163

139164
For example, when using PSM2 provider, the number of contexts is dictated
140165
by the Intel Omni-Path HFI1 driver module.
141166

167+
- OPAL layer allows for multiple threads to enter progress simultaneously. To
168+
enable this feature, user needs to set MCA variable
169+
"max_thread_in_progress". When using Thread Grouping feature, it is
170+
recommended to set this MCA parameter to the number of threads expected to
171+
call into MPI as it provides performance benefits.
172+
173+
Command-line syntax:
174+
"-mca opal_max_thread_in_progress N" [ N: number of threads expected to
175+
make MPI calls ]
176+
Default: 1
177+
142178
- For applications using a single thread with multiple communicators and MCA
143179
variable "mtl_ofi_thread_grouping" set to 1, the MTL will use multiple
144180
contexts, but the benefits may be negligible as only one thread is driving

ompi/mca/mtl/ofi/help-mtl-ofi.txt

Lines changed: 32 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -26,17 +26,42 @@ fi_info -v -p %s
2626
Location: %s:%d
2727

2828
[SEP unavailable]
29-
Scalable Endpoint feature is required for Thread Grouping feature to work
30-
but it is not supported by %s provider. Try disabling this feature.
29+
Scalable Endpoint feature is enabled by the user but it is not supported by
30+
%s provider. Try disabling this feature or use a different provider that
31+
supports it using mtl_ofi_provider_include.
3132

3233
Local host: %s
3334
Location: %s:%d
3435

35-
[SEP ctxt limit]
36-
Reached limit (%d) for number of OFI contexts that can be opened with the
37-
provider. Creating new communicators beyond this limit is possible but
38-
they will re-use existing contexts in round-robin fashion.
39-
Using new communicators beyond the limit will impact performance.
36+
[SEP required]
37+
Scalable Endpoint feature is required for Thread Grouping feature to work.
38+
Please try enabling Scalable Endpoints using mtl_ofi_enable_sep.
39+
40+
Local host: %s
41+
Location: %s:%d
42+
43+
[SEP thread grouping ctxt limit]
44+
Reached limit (%d) for number of OFI contexts set by mtl_ofi_num_ctxts.
45+
Please set mtl_ofi_num_ctxts to a larger value if you need more contexts.
46+
If an MPI application creates more communicators than mtl_ofi_num_ctxts,
47+
OFI MTL will make the new communicators re-use existing contexts in
48+
round-robin fashion which will impact performance.
49+
50+
Local host: %s
51+
Location: %s:%d
52+
53+
[Local ranks exceed ofi contexts]
54+
Number of local ranks exceed the number of available OFI contexts in %s
55+
provider and we cannot provision enough contexts for each rank. Try disabling
56+
Scalable Endpoint feature.
57+
58+
Local host: %s
59+
Location: %s:%d
60+
61+
[Ctxts exceeded available]
62+
User requested for more than available contexts from provider. Limiting
63+
to max allowed (%d). Contexts will be re used in round-robin fashion if there
64+
are more threads than the available contexts.
4065

4166
Local host: %s
4267
Location: %s:%d

ompi/mca/mtl/ofi/mtl_ofi.h

Lines changed: 36 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -327,16 +327,7 @@ ompi_mtl_ofi_isend_callback(struct fi_cq_tagged_entry *wc,
327327

328328
#define MTL_OFI_MAP_COMM_TO_CONTEXT(comm_id, ctxt_id) \
329329
do { \
330-
if (ompi_mtl_ofi.thread_grouping && \
331-
(!ompi_mtl_ofi.threshold_comm_context_id || \
332-
((uint32_t) ompi_mtl_ofi.threshold_comm_context_id > comm_id))) { \
333-
ctxt_id = ompi_mtl_ofi.comm_to_context[comm_id]; \
334-
} else if (ompi_mtl_ofi.thread_grouping) { \
335-
/* Round-robin assignment of contexts if threshold is reached */ \
336-
ctxt_id = comm_id % ompi_mtl_ofi.total_ctxts_used; \
337-
} else { \
338-
ctxt_id = 0; \
339-
} \
330+
ctxt_id = ompi_mtl_ofi.comm_to_context[comm_id]; \
340331
} while (0);
341332

342333
__opal_attribute_always_inline__ static inline int
@@ -348,40 +339,40 @@ ompi_mtl_ofi_ssend_recv(ompi_mtl_ofi_request_t *ack_req,
348339
uint64_t *match_bits,
349340
int tag)
350341
{
351-
ssize_t ret = OMPI_SUCCESS;
352-
int ctxt_id = 0;
342+
ssize_t ret = OMPI_SUCCESS;
343+
int ctxt_id = 0;
353344

354-
MTL_OFI_MAP_COMM_TO_CONTEXT(comm->c_contextid, ctxt_id);
355-
set_thread_context(ctxt_id);
345+
MTL_OFI_MAP_COMM_TO_CONTEXT(comm->c_contextid, ctxt_id);
346+
set_thread_context(ctxt_id);
356347

357-
ack_req = malloc(sizeof(ompi_mtl_ofi_request_t));
358-
assert(ack_req);
348+
ack_req = malloc(sizeof(ompi_mtl_ofi_request_t));
349+
assert(ack_req);
359350

360-
ack_req->parent = ofi_req;
361-
ack_req->event_callback = ompi_mtl_ofi_send_ack_callback;
362-
ack_req->error_callback = ompi_mtl_ofi_send_ack_error_callback;
351+
ack_req->parent = ofi_req;
352+
ack_req->event_callback = ompi_mtl_ofi_send_ack_callback;
353+
ack_req->error_callback = ompi_mtl_ofi_send_ack_error_callback;
363354

364-
ofi_req->completion_count += 1;
355+
ofi_req->completion_count += 1;
365356

366-
MTL_OFI_RETRY_UNTIL_DONE(fi_trecv(ompi_mtl_ofi.ofi_ctxt[ctxt_id].rx_ep,
367-
NULL,
368-
0,
369-
NULL,
370-
*src_addr,
371-
*match_bits | ompi_mtl_ofi.sync_send_ack,
372-
0, /* Exact match, no ignore bits */
373-
(void *) &ack_req->ctx), ret);
374-
if (OPAL_UNLIKELY(0 > ret)) {
375-
opal_output_verbose(1, ompi_mtl_base_framework.framework_output,
376-
"%s:%d: fi_trecv failed: %s(%zd)",
377-
__FILE__, __LINE__, fi_strerror(-ret), ret);
378-
free(ack_req);
379-
return ompi_mtl_ofi_get_error(ret);
380-
}
357+
MTL_OFI_RETRY_UNTIL_DONE(fi_trecv(ompi_mtl_ofi.ofi_ctxt[ctxt_id].rx_ep,
358+
NULL,
359+
0,
360+
NULL,
361+
*src_addr,
362+
*match_bits | ompi_mtl_ofi.sync_send_ack,
363+
0, /* Exact match, no ignore bits */
364+
(void *) &ack_req->ctx), ret);
365+
if (OPAL_UNLIKELY(0 > ret)) {
366+
opal_output_verbose(1, ompi_mtl_base_framework.framework_output,
367+
"%s:%d: fi_trecv failed: %s(%zd)",
368+
__FILE__, __LINE__, fi_strerror(-ret), ret);
369+
free(ack_req);
370+
return ompi_mtl_ofi_get_error(ret);
371+
}
381372

382-
/* The SYNC_SEND tag bit is set for the send operation only.*/
383-
MTL_OFI_SET_SYNC_SEND(*match_bits);
384-
return OMPI_SUCCESS;
373+
/* The SYNC_SEND tag bit is set for the send operation only.*/
374+
MTL_OFI_SET_SYNC_SEND(*match_bits);
375+
return OMPI_SUCCESS;
385376
}
386377

387378
__opal_attribute_always_inline__ static inline int
@@ -1242,13 +1233,15 @@ static int ompi_mtl_ofi_init_contexts(struct mca_mtl_base_module_t *mtl,
12421233
}
12431234

12441235
/*
1245-
* We only create upto Max number of contexts allowed by provider.
1236+
* We only create upto Max number of contexts asked for by the user.
12461237
* If user enables thread grouping feature and creates more number of
1247-
* communicators than we have contexts, then we set the threshold
1248-
* context_id so we know to use context 0 for operations involving these
1249-
* "extra" communicators.
1238+
* communicators than available contexts, then we set the threshold
1239+
* context_id so that new communicators created beyond the threshold
1240+
* will be assigned to contexts in a round-robin fashion.
12501241
*/
1251-
if (ompi_mtl_ofi.max_ctx_cnt <= ctxt_id) {
1242+
if (ompi_mtl_ofi.num_ofi_contexts <= ompi_mtl_ofi.total_ctxts_used) {
1243+
ompi_mtl_ofi.comm_to_context[comm->c_contextid] = comm->c_contextid %
1244+
ompi_mtl_ofi.total_ctxts_used;
12521245
if (!ompi_mtl_ofi.threshold_comm_context_id) {
12531246
ompi_mtl_ofi.threshold_comm_context_id = comm->c_contextid;
12541247

0 commit comments

Comments
 (0)