Skip to content

Commit fe25097

Browse files
committed
mtl/ofi: Print descriptive error message on modex failure
With MTLs, there's no "other transport" when the remote side does not have an active NIC, so we should print a useful error message when the modex failed (indicating lack of a NIC on the remote side). Signed-off-by: Brian Barrett <bbarrett@amazon.com>
1 parent 352b667 commit fe25097

File tree

2 files changed

+14
-3
lines changed

2 files changed

+14
-3
lines changed

ompi/mca/mtl/ofi/help-mtl-ofi.txt

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,3 +65,13 @@ are more threads than the available contexts.
6565

6666
Local host: %s
6767
Location: %s:%d
68+
69+
[modex failed]
70+
The OFI MTL was not able to find endpoint information for a remote
71+
endpoint. Most likely, this means that the remote process was unable
72+
to initialize the Libfabric NIC correctly. This error is not
73+
recoverable and your application is likely to abort.
74+
75+
Local host: %s
76+
Remote host: %s
77+
Error: %s (%d)

ompi/mca/mtl/ofi/mtl_ofi.c

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -98,9 +98,10 @@ ompi_mtl_ofi_add_procs(struct mca_mtl_base_module_t *mtl,
9898
(void**)&ep_name,
9999
&size);
100100
if (OMPI_SUCCESS != ret) {
101-
opal_output_verbose(1, ompi_mtl_base_framework.framework_output,
102-
"%s:%d: modex_recv failed: %d\n",
103-
__FILE__, __LINE__, ret);
101+
opal_show_help("help-mtl-ofi.txt", "modex failed",
102+
true, ompi_process_info.nodename,
103+
procs[i]->super.proc_hostname,
104+
opal_strerror(ret), ret);
104105
goto bail;
105106
}
106107
memcpy(&ep_names[i*namelen], ep_name, namelen);

0 commit comments

Comments
 (0)