Skip to content

ULFM cannot handle faults in MPI_Comm_spawn-ed processes #13325

Open
@Robyroc

Description

@Robyroc

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

branch main, hash 4038fd6, with changes to ompi/mca/coll/base/coll_base_comm_select.c according to this

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

from git clone, installing submodules openpmix and prrte separately:

for openpmix (installed hash: 255b55dc32347889664877a6c147a88315ab9e75):

git clone https://github.com/openpmix/openpmix.git
cd openpmix/
git submodule update --init
./autogen.pl
./configure --prefix=$OPENPMIX_PREFIX --disable-debug --disable-assertions
make install

for prrte (installed hash: a809f3e68a6bd676616daaa21d388e92bad30bf8):

git clone https://github.com/openpmix/prrte.git
cd prrte
git submodule update --init
./autogen.pl
./configure --prefix=$PRRTE_PREFIX --disable-debug --disable-assertions --with-pmix=$OPENPMIX_PREFIX --without-slurm --without-pbs 
make install

lastly, I compiled OpenMPI from source

git clone https://github.com/open-mpi/ompi.git
cd ompi
git submodule update --init config/oac 3rd-party/pympistandard
./autogen.pl --no-3rdparty openpmix,prrte
./configure --prefix=$OMPI_PREFIX --disable-debug --disable-assertions --with-libevent=external --with-hwloc=external --with-pmix=$OPENPMIX_PREFIX --with-prrte=$PRRTE_PREFIX --with-ucx --with-ft=ulfm
make install

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

-08e41ed5629b51832f5708181af6d89218c7a74e 3rd-party/openpmix
-30cadc6746ebddd69ea42ca78b964398f782e4e3 3rd-party/prrte
6032f68dd9636b48977f59e986acc01a746593a6 3rd-party/pympistandard (heads/develop)
dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (dfff675)

Please describe the system on which you are running

  • Operating system/version: RHEL 9.4
  • Computer hardware: 2x AMD EPYC 7543 32-Core Processor, 2 threads per core, for a total maximum of 128 processes per node
  • Network type: single node

Details of the problem

I am trying to test the combination of dynamic process management and ULFM, but it seems that faults in processes generated by the MPI_Comm_spawn cause the execution to stop instead of raising an error that we can handle through ULFM. Here is a small example that shows the erroneous behaviour:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <signal.h>
#include <unistd.h>
#include <mpi.h>
#include <mpi-ext.h>

#define SPAWN_SIZE 6

int main(int argc, char *argv[])
{
    MPI_Init(&argc, &argv);
    MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
    MPI_Comm_set_errhandler(MPI_COMM_SELF, MPI_ERRORS_RETURN);

    int size, rank;
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    MPI_Comm parent_comm;
    MPI_Comm_get_parent(&parent_comm);

    if(parent_comm == MPI_COMM_NULL)
    {
        MPI_Comm intercomm;
        printf("[%d] First execution, spawning new processes\n", rank);
        MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, SPAWN_SIZE, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &intercomm, MPI_ERRCODES_IGNORE);
        printf("[%d] After spawn\n", rank);
        fflush(stdout);
        MPI_Barrier(intercomm);
    }
    else
    {
        printf("(%d) created, waiting to start\n", rank);
        
        if (rank > 3)
        {
            printf("(%d) committing suicide\n", rank);
            fflush(stdout);
            raise(SIGINT);
        }

        int rc, ec;
        printf("(%d) About to barrier\n", rank);
        fflush(stdout);
        rc = MPI_Barrier(MPI_COMM_WORLD);
        int flag = (rc == MPI_SUCCESS);
        MPIX_Comm_agree(MPI_COMM_WORLD, &flag);
        printf("(%d) rc -> %d\n", rank, rc);
        fflush(stdout);

        if (!flag)
        {
            printf("(%d) repairing communicator\n", rank);

            MPI_Comm new_world;
            MPIX_Comm_shrink(MPI_COMM_WORLD, &new_world);
            
            MPI_Comm_size(new_world, &size);
            MPI_Comm_rank(new_world, &rank);
            
            MPI_Barrier(new_world);
            printf("(%d) repaired communicator size is %d\n", rank, size);
            
            MPI_Comm_free(&new_world);
        } else {
            printf("(%d) failed to shrink communicator\n", rank);
        }
        fflush(stdout);
        MPI_Barrier(parent_comm);
    }
    MPI_Finalize();
    return 0;
}

I am expecting processes spawned by the MPI_Comm_spawn call to run the line printf("(%d) rc -> %d\n", rank, rc);, regardless of rc. However, if i compile the code with -g and run it with mpirun --with-ft ulfm -np 2 ./spawn_kill i get this output:

[1] First execution, spawning new processes
[0] First execution, spawning new processes
[1] After spawn
[0] After spawn
(1) created, waiting to start
(1) About to barrier
(3) created, waiting to start
(3) About to barrier
(5) created, waiting to start
(5) committing suicide
(0) created, waiting to start
(0) About to barrier
(2) created, waiting to start
(2) About to barrier
(4) created, waiting to start
(4) committing suicide
[awnode05.e4red:1003300] OPAL ERROR: (null) in file errhandler/errhandler.c at line 444
--------------------------------------------------------------------------

prterun noticed that process rank 4 with PID 1003308 on node awnode05 exited on
signal 2 (Interrupt).
--------------------------------------------------------------------------

If I run adding --mca mpi_ft_verbose 1 I get this output instead:

[1] First execution, spawning new processes
[0] First execution, spawning new processes
[1] After spawn
[0] After spawn
(4) created, waiting to start
(4) committing suicide
(2) created, waiting to start
(2) About to barrier
(3) created, waiting to start
(3) About to barrier
(5) created, waiting to start
(5) committing suicide
(0) created, waiting to start
(0) About to barrier
(1) created, waiting to start
(1) About to barrier
[awnode05.e4red:1003331] [[2153,1],0] ompi: Process [[2153,2],0] failed (state = -200 PMIX_ERR_PROC_TERM_WO_SYNC).
[awnode05.e4red:1003331] OPAL ERROR: (null) in file errhandler/errhandler.c at line 444
--------------------------------------------------------------------------

prterun noticed that process rank 4 with PID 1003339 on node awnode05 exited on
signal 2 (Interrupt).
--------------------------------------------------------------------------
[awnode05.e4red:1003332] [[2153,1],1] ompi: Process [[2153,2],0] failed (state = -200 PMIX_ERR_PROC_TERM_WO_SYNC).

I was expecting the other spawned processes to recognise the fault in their peers and proceed with the shrink, but they do not even complete the barrier that highlights the fault, as they do not print out the error code.

Thanks in advance for your time

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions