Description
Background information
We found that using the yalla and ucx pml, message queue displays in DDT no longer works, and we need to fall back to the ob1 pml. This seems to be an ok workaround for now but with newer OpenMPI's removing the openib btl, does that mean we will need to use TCP/IP for debugging, or perhaps the new still somewhat experimental UCT BTL?
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
Tested with OpenMPI 2.1.1 (patched to work with DDT in general) with the yalla and ob1 pml, 3.1.1 and 3.1.2 with ucx and ob1 pml. Tested with DDT (Arm Forge) 7.1, 18.2, 18.3
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
It was compiled from a source tarball.
Please describe the system on which you are running
- Operating system/version: CentOS 7.4
- Computer hardware: x86_64 Skylake SP and Broadwell.
- Network type: Infiniband Mellanox ConnectX-5.
Details of the problem
We expect to see what is show in the first screenshot (with ob1) but see the second with ucx and yalla.
The test case is a simple MPI deadlock program compiled using mpicc -g deadlock_ring.c -o deadlock_ring
. The compiler (used GCC 5.4.0, 7.3.0, Intel 2016 update 4) does not matter.
/******************************************************************************
Complex deadlock bug (loop over all ranks).
Solutions:
MPI_Sendrecv(Arr, N, MPI_INT, rank_next, tag, Arr, N, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &status);
MPI_Sendrecv_replace(Arr, N, MPI_INT, rank_next, tag, rank_prev, tag, MPI_COMM_WORLD, &status);
******************************************************************************/
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
// Try 1 and 10000
#define N 10000
int main (int argc, char *argv[])
{
int numtasks, rank, tag=0, rank_prev, rank_next;
int Arr[N];
MPI_Status status;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
printf("Task %d starting...\n",rank);
// Neighboring tasks:
rank_prev = rank - 1;
rank_next = rank + 1;
// Imposing periodic boundaries on ranks:
if (rank_prev < 0)
rank_prev = numtasks - 1;
if (rank_next == numtasks)
rank_next = 0;
MPI_Ssend(Arr, N, MPI_INT, rank_next, tag, MPI_COMM_WORLD);
MPI_Recv(Arr, N, MPI_INT, rank_prev, tag, MPI_COMM_WORLD, &status);
printf ("Finished\n");
MPI_Finalize();
}