Description
There seem to be some bugs with the openmpi-4.1.4 when it is used with slurm 17.11.7 and intel Omni-path.
My CPUs are Intel Xeon E5-2695v4 (Broadwell Nodes with 15 GB /scratch).
When I use srun --mpi=pmi2 -n 37 -p bdw ./my_program
the following errors would appear:
[bdw-0165:11363] *** An error occurred in MPI_Init
[bdw-0165:11363] *** reported by process [1958739981,3]
[bdw-0165:11363] *** on a NULL communicator
[bdw-0165:11363] *** Unknown error
[bdw-0165:11363] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[bdw-0165:11363] *** and potentially your MPI job)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
PML add procs failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[bdw-0165:11364] *** An error occurred in MPI_Init
[bdw-0165:11364] *** reported by process [18446744071373324301,4]
[bdw-0165:11364] *** on a NULL communicator
[bdw-0165:11364] *** Unknown error
[bdw-0165:11364] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[bdw-0165:11364] *** and potentially your MPI job)
[bdw-0165:11366] *** An error occurred in MPI_Init
[bdw-0165:11366] *** reported by process [18446744071373324301,6]
[bdw-0165:11366] *** on a NULL communicator
[bdw-0165:11366] *** Unknown error
[bdw-0165:11366] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[bdw-0165:11366] *** and potentially your MPI job)
[bdw-0165:11368] *** An error occurred in MPI_Init
[bdw-0165:11368] *** reported by process [18446744071373324301,8]
[bdw-0165:11368] *** on a NULL communicator
[bdw-0165:11368] *** Unknown error
[bdw-0165:11368] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[bdw-0165:11368] *** and potentially your MPI job)
bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy
bdw-0165.11364hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (1/3)
bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy
bdw-0165.11364hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (2/3)
bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy
bdw-0165.11364hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (3/3)
bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy
slurmstepd: error: *** STEP 2520256.13 ON bdw-0165 CANCELLED AT 2022-07-23T18:15:04 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: bdw-0165: tasks 0-4,6,8: Exited with exit code 1
srun: error: bdw-0165: tasks 5,7,9-18: Killed
srun: error: bdw-0175: tasks 19-36: Killed
I searched the internet and find that someone has already found the similar problems when they are using Broadwell nodes, InfiniBand, SLURM 18.08.3 and open-mpi. Here is the link: https://bugs.schedmd.com/show_bug.cgi?id=5956.
The last message in the ticket suggests that they are contacting the open-mpi team for solving the problem. However, I doubt that the open-mpi team has solved the bug.
It seems to be working for 36 processes. I think it may be because for processes less than 36, they are all in the same node as the Intel Xeon E5-2695v4 has 36 cores per CPU.