Description
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
branch main, hash 4038fd6, with changes to ompi/mca/coll/base/coll_base_comm_select.c according to this
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
from git clone, installing submodules openpmix and prrte separately:
for openpmix (installed hash: 255b55dc32347889664877a6c147a88315ab9e75):
git clone https://github.com/openpmix/openpmix.git
cd openpmix/
git submodule update --init
./autogen.pl
./configure --prefix=$OPENPMIX_PREFIX --disable-debug --disable-assertions
make install
for prrte (installed hash: a809f3e68a6bd676616daaa21d388e92bad30bf8):
git clone https://github.com/openpmix/prrte.git
cd prrte
git submodule update --init
./autogen.pl
./configure --prefix=$PRRTE_PREFIX --disable-debug --disable-assertions --with-pmix=$OPENPMIX_PREFIX --without-slurm --without-pbs
make install
lastly, I compiled OpenMPI from source
git clone https://github.com/open-mpi/ompi.git
cd ompi
git submodule update --init config/oac 3rd-party/pympistandard
./autogen.pl --no-3rdparty openpmix,prrte
./configure --prefix=$OMPI_PREFIX --disable-debug --disable-assertions --with-libevent=external --with-hwloc=external --with-pmix=$OPENPMIX_PREFIX --with-prrte=$PRRTE_PREFIX --with-ucx --with-ft=ulfm
make install
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status
.
-08e41ed5629b51832f5708181af6d89218c7a74e 3rd-party/openpmix
-30cadc6746ebddd69ea42ca78b964398f782e4e3 3rd-party/prrte
6032f68dd9636b48977f59e986acc01a746593a6 3rd-party/pympistandard (heads/develop)
dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (dfff675)
Please describe the system on which you are running
- Operating system/version: RHEL 9.4
- Computer hardware: 2x AMD EPYC 7543 32-Core Processor, 2 threads per core, for a total maximum of 128 processes per node
- Network type: single node
Details of the problem
I am trying to test the combination of dynamic process management and ULFM, but it seems that faults in processes generated by the MPI_Comm_spawn cause the execution to stop instead of raising an error that we can handle through ULFM. Here is a small example that shows the erroneous behaviour:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <signal.h>
#include <unistd.h>
#include <mpi.h>
#include <mpi-ext.h>
#define SPAWN_SIZE 6
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
MPI_Comm_set_errhandler(MPI_COMM_SELF, MPI_ERRORS_RETURN);
int size, rank;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm parent_comm;
MPI_Comm_get_parent(&parent_comm);
if(parent_comm == MPI_COMM_NULL)
{
MPI_Comm intercomm;
printf("[%d] First execution, spawning new processes\n", rank);
MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, SPAWN_SIZE, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &intercomm, MPI_ERRCODES_IGNORE);
printf("[%d] After spawn\n", rank);
fflush(stdout);
MPI_Barrier(intercomm);
}
else
{
printf("(%d) created, waiting to start\n", rank);
if (rank > 3)
{
printf("(%d) committing suicide\n", rank);
fflush(stdout);
raise(SIGINT);
}
int rc, ec;
printf("(%d) About to barrier\n", rank);
fflush(stdout);
rc = MPI_Barrier(MPI_COMM_WORLD);
int flag = (rc == MPI_SUCCESS);
MPIX_Comm_agree(MPI_COMM_WORLD, &flag);
printf("(%d) rc -> %d\n", rank, rc);
fflush(stdout);
if (!flag)
{
printf("(%d) repairing communicator\n", rank);
MPI_Comm new_world;
MPIX_Comm_shrink(MPI_COMM_WORLD, &new_world);
MPI_Comm_size(new_world, &size);
MPI_Comm_rank(new_world, &rank);
MPI_Barrier(new_world);
printf("(%d) repaired communicator size is %d\n", rank, size);
MPI_Comm_free(&new_world);
} else {
printf("(%d) failed to shrink communicator\n", rank);
}
fflush(stdout);
MPI_Barrier(parent_comm);
}
MPI_Finalize();
return 0;
}
I am expecting processes spawned by the MPI_Comm_spawn call to run the line printf("(%d) rc -> %d\n", rank, rc);
, regardless of rc. However, if i compile the code with -g and run it with mpirun --with-ft ulfm -np 2 ./spawn_kill
i get this output:
[1] First execution, spawning new processes
[0] First execution, spawning new processes
[1] After spawn
[0] After spawn
(1) created, waiting to start
(1) About to barrier
(3) created, waiting to start
(3) About to barrier
(5) created, waiting to start
(5) committing suicide
(0) created, waiting to start
(0) About to barrier
(2) created, waiting to start
(2) About to barrier
(4) created, waiting to start
(4) committing suicide
[awnode05.e4red:1003300] OPAL ERROR: (null) in file errhandler/errhandler.c at line 444
--------------------------------------------------------------------------
prterun noticed that process rank 4 with PID 1003308 on node awnode05 exited on
signal 2 (Interrupt).
--------------------------------------------------------------------------
If I run adding --mca mpi_ft_verbose 1
I get this output instead:
[1] First execution, spawning new processes
[0] First execution, spawning new processes
[1] After spawn
[0] After spawn
(4) created, waiting to start
(4) committing suicide
(2) created, waiting to start
(2) About to barrier
(3) created, waiting to start
(3) About to barrier
(5) created, waiting to start
(5) committing suicide
(0) created, waiting to start
(0) About to barrier
(1) created, waiting to start
(1) About to barrier
[awnode05.e4red:1003331] [[2153,1],0] ompi: Process [[2153,2],0] failed (state = -200 PMIX_ERR_PROC_TERM_WO_SYNC).
[awnode05.e4red:1003331] OPAL ERROR: (null) in file errhandler/errhandler.c at line 444
--------------------------------------------------------------------------
prterun noticed that process rank 4 with PID 1003339 on node awnode05 exited on
signal 2 (Interrupt).
--------------------------------------------------------------------------
[awnode05.e4red:1003332] [[2153,1],1] ompi: Process [[2153,2],0] failed (state = -200 PMIX_ERR_PROC_TERM_WO_SYNC).
I was expecting the other spawned processes to recognise the fault in their peers and proceed with the shrink, but they do not even complete the barrier that highlights the fault, as they do not print out the error code.
Thanks in advance for your time