An error occurred in MPI_Init (seems the network initialization has some bugs)

There seem to be some bugs with the openmpi-4.1.4 when it is used with slurm 17.11.7 and intel Omni-path.
My CPUs are Intel Xeon E5-2695v4 (Broadwell Nodes with 15 GB /scratch). 
When I use `srun --mpi=pmi2 -n 37 -p bdw ./my_program`  the following errors would appear:

```
[bdw-0165:11363] *** An error occurred in MPI_Init
[bdw-0165:11363] *** reported by process [1958739981,3]
[bdw-0165:11363] *** on a NULL communicator
[bdw-0165:11363] *** Unknown error
[bdw-0165:11363] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[bdw-0165:11363] ***    and potentially your MPI job)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[bdw-0165:11364] *** An error occurred in MPI_Init
[bdw-0165:11364] *** reported by process [18446744071373324301,4]
[bdw-0165:11364] *** on a NULL communicator
[bdw-0165:11364] *** Unknown error
[bdw-0165:11364] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[bdw-0165:11364] ***    and potentially your MPI job)
[bdw-0165:11366] *** An error occurred in MPI_Init
[bdw-0165:11366] *** reported by process [18446744071373324301,6]
[bdw-0165:11366] *** on a NULL communicator
[bdw-0165:11366] *** Unknown error
[bdw-0165:11366] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[bdw-0165:11366] ***    and potentially your MPI job)
[bdw-0165:11368] *** An error occurred in MPI_Init
[bdw-0165:11368] *** reported by process [18446744071373324301,8]
[bdw-0165:11368] *** on a NULL communicator
[bdw-0165:11368] *** Unknown error
[bdw-0165:11368] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[bdw-0165:11368] ***    and potentially your MPI job)
bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy
bdw-0165.11364hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (1/3)
bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy
bdw-0165.11364hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (2/3)
bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy
bdw-0165.11364hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (3/3)
bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy
slurmstepd: error: *** STEP 2520256.13 ON bdw-0165 CANCELLED AT 2022-07-23T18:15:04 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: bdw-0165: tasks 0-4,6,8: Exited with exit code 1
srun: error: bdw-0165: tasks 5,7,9-18: Killed
srun: error: bdw-0175: tasks 19-36: Killed
```

I searched the internet and find that someone has already found the similar problems when they are using Broadwell nodes, InfiniBand, SLURM 18.08.3 and open-mpi. Here is the link: https://bugs.schedmd.com/show_bug.cgi?id=5956.
The last message in the ticket suggests that they are contacting the open-mpi team for solving the problem. However, I doubt that the open-mpi team has solved the bug.
It seems to be working for 36 processes. I think it may be because for processes less than 36, they are all in the same node as the Intel Xeon E5-2695v4 has 36 cores per CPU.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

An error occurred in MPI_Init (seems the network initialization has some bugs) #10601

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

An error occurred in MPI_Init (seems the network initialization has some bugs) #10601

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions