Skip to content

v4.1: UCX onesided crash (ibm/onesided/1sided) #10410

Open
@jjhursey

Description

@jjhursey

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.1.x branch

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone of the v4.1.x branch at 0987319

Please describe the system on which you are running

  • Operating system/version: RHEL 8.4
  • Computer hardware: ppc64le
  • Network type: Infiniband with mlx5 cards

Details of the problem

MTT found this problem while testing the v4.1.x branch and running the ibm/onesided/1sided test. I'm using an older UCX (1.11.2) because I have an older MOFED (MLNX_OFED_LINUX-4.9-4.1.1.1) and that is what's supported. So this might be a UCX issue, but I'm not sure.

Open MPI was configured with:

./configure --enable-mpirun-prefix-by-default --disable-dlopen --enable-io-romio \
   --disable-io-ompio --enable-mpi1-compatibility \
   --with-ucx=/opt/ucx-1.11.2/ --without-hcoll \
   --enable-debug --enable-picky

The test case was run with 3 nodes and 2 processes per node:

mpirun --host f5n18:20,f5n17:20,f5n16:20 --npernode 2 -mca pml ucx -mca osc ucx,sm -mca btl ^openib ./1sided

The test runs for a while, but in phase 8 one or more of the processes with crash:

seed value: 1610634988
[mesgsize 5976]
phase 5 part 1 (loop(i){fence;get;fence}) c-int chk [st:fence]
phase 5 part 1 (loop(i){fence;get;fence}) c-int chk [st:post]
  iter: 61525, time: 3.000061 sec
phase 5 part 1 (loop(i){fence;get;fence}) c-int chk [st:test]
  iter: 61133, time: 3.000064 sec
...
phase 8 part 3 (fence;loop(i){accum};fence) nc-int chk [st:fence]
[f5n16:2646755:0:2646755] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfa000000fa0)
==== backtrace (tid:2646755) ====
 0  /smpi_dev/jjhursey/local/ucx-1.11.2/lib/libucs.so.0(ucs_handle_error+0x324) [0x7fff825273b4]
 1  /smpi_dev/jjhursey/local/ucx-1.11.2/lib/libucs.so.0(+0x37560) [0x7fff82527560]
 2  /smpi_dev/jjhursey/local/ucx-1.11.2/lib/libucs.so.0(+0x37990) [0x7fff82527990]
 3  linux-vdso64.so.1(__kernel_sigtramp_rt64+0) [0x7fff835e04d8]
 4  [0xfa000000fa0]
 5  /smpi_dev/jjhursey/dev/ompi/install/ompi-v4.1-debug/lib/libmpi_ftw.so.40(PMPI_Win_fence+0x1a0) [0x7fff8325bdc8]
 6  ./1sided() [0x10004538]
 7  ./1sided() [0x10004934]
 8  ./1sided() [0x100049ac]
 9  /lib64/libc.so.6(+0x24c78) [0x7fff82e84c78]
10  /lib64/libc.so.6(__libc_start_main+0xb4) [0x7fff82e84e64]
=================================
[f5n17:1073992:0:1073992] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7d0000008a0)
==== backtrace (tid:1073992) ====
 0  /smpi_dev/jjhursey/local/ucx-1.11.2/lib/libucs.so.0(ucs_handle_error+0x324) [0x7fffb73073b4]
 1  /smpi_dev/jjhursey/local/ucx-1.11.2/lib/libucs.so.0(+0x37560) [0x7fffb7307560]
 2  /smpi_dev/jjhursey/local/ucx-1.11.2/lib/libucs.so.0(+0x37990) [0x7fffb7307990]
 3  linux-vdso64.so.1(__kernel_sigtramp_rt64+0) [0x7fffb83c04d8]
 4  /smpi_dev/jjhursey/local/ucx-1.11.2/lib/ucx/libuct_ib.so.0(uct_rc_iface_flush+0xb0) [0x7fffb4401e00]
 5  /smpi_dev/jjhursey/local/ucx-1.11.2/lib/libucp.so.0(+0x6a4e4) [0x7fffb743a4e4]
 6  /smpi_dev/jjhursey/local/ucx-1.11.2/lib/libucp.so.0(ucp_worker_flush_nbx+0x1c4) [0x7fffb743cd94]
 7  /smpi_dev/jjhursey/local/ucx-1.11.2/lib/libucp.so.0(ucp_worker_flush_nb+0x54) [0x7fffb743cf04]
 8  /smpi_dev/jjhursey/dev/ompi/install/ompi-v4.1-debug/lib/libmpi_ftw.so.40(+0x2e3dc4) [0x7fffb81a3dc4]
 9  /smpi_dev/jjhursey/dev/ompi/install/ompi-v4.1-debug/lib/libmpi_ftw.so.40(ompi_osc_ucx_fence+0xcc) [0x7fffb81a41c4]
10  /smpi_dev/jjhursey/dev/ompi/install/ompi-v4.1-debug/lib/libmpi_ftw.so.40(PMPI_Win_fence+0x1a0) [0x7fffb803bdc8]
11  ./1sided() [0x10004538]
12  ./1sided() [0x10004934]
13  ./1sided() [0x100049ac]
14  /lib64/libc.so.6(+0x24c78) [0x7fffb7c64c78]
15  /lib64/libc.so.6(__libc_start_main+0xb4) [0x7fffb7c64e64]
=================================
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 4 with PID 2646755 on node f5n16 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
2 total processes killed (some possibly by mpirun during cleanup)

The stack I got out of gdb is:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fffb1f21e00 in uct_ep_flush (comp=0x0, flags=0, ep=0x10014d454e0)
    at /opt/ucx-1.11.2/src/uct/api/uct.h:3050
3050	    return ep->iface->ops.ep_flush(ep, flags, comp);
[Current thread is 1 (Thread 0x7fffb5f5e7f0 (LWP 2635823))]
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-151.el8.ppc64le libblkid-2.32.1-27.el8.ppc64le libevent-2.1.8-5.el8.ppc64le libgcc-8.4.1-1.el8.ppc64le libibverbs-41mlnx1-OFED.4.9.3.0.0.49411.ppc64le libmlx4-41mlnx1-OFED.4.7.3.0.3.49411.ppc64le libmlx5-41mlnx1-OFED.4.9.0.1.2.49411.ppc64le libmount-2.32.1-27.el8.ppc64le libnl3-3.5.0-1.el8.ppc64le librdmacm-41mlnx1-OFED.4.7.3.0.6.49411.ppc64le libselinux-2.9-5.el8.ppc64le libuuid-2.32.1-27.el8.ppc64le numactl-libs-2.0.12-11.el8.ppc64le openssl-libs-1.1.1g-15.el8_3.ppc64le pcre2-10.32-2.el8.ppc64le systemd-libs-239-45.el8_4.3.ppc64le zlib-1.2.11-17.el8.ppc64le
(gdb) bt
#0  0x00007fffb1f21e00 in uct_ep_flush (comp=0x0, flags=0, ep=0x10014d454e0)
    at /opt/ucx-1.11.2/src/uct/api/uct.h:3050
#1  uct_rc_iface_flush (tl_iface=0x10014c3b4b0, flags=<optimized out>, comp=<optimized out>) at rc/base/rc_iface.c:294
#2  0x00007fffb4f5a4e4 in uct_iface_flush (comp=0x0, flags=0, iface=<optimized out>)
    at /opt/ucx-1.11.2/src/uct/api/uct.h:2627
#3  ucp_worker_flush_check (worker=0x10014bcc150) at rma/flush.c:411
#4  0x00007fffb4f5cd94 in ucp_worker_flush_nbx_internal (param=0x7fffedc528f0, worker=0x10014bcc150) at rma/flush.c:554
#5  ucp_worker_flush_nbx (worker=0x10014bcc150, param=0x7fffedc528f0) at rma/flush.c:596
#6  0x00007fffb4f5cf04 in ucp_worker_flush_nb (worker=<optimized out>, flags=<optimized out>, cb=<optimized out>) at rma/flush.c:586
#7  0x00007fffb5cc3dc4 in opal_common_ucx_worker_flush (worker=0x10014bcc150) at ../../../../opal/mca/common/ucx/common_ucx.h:179
#8  0x00007fffb5cc41c4 in ompi_osc_ucx_fence (assert=0, win=0x10014b5cb60) at osc_ucx_active_target.c:77
#9  0x00007fffb5b5bdc8 in PMPI_Win_fence (assert=0, win=0x10014b5cb60) at pwin_fence.c:60
#10 0x0000000010004538 in main_test_fn (comm=0x7fffb5ec39a8 <ompi_mpi_comm_world>, tid=1) at 1sided.c:702
#11 0x0000000010004934 in runtest (comm=0x7fffb5ec39a8 <ompi_mpi_comm_world>, tid=1) at 1sided.c:762
#12 0x00000000100049ac in main () at 1sided.c:772
(gdb) l
3045	 *                               upon completion of these operations.
3046	 */
3047	UCT_INLINE_API ucs_status_t uct_ep_flush(uct_ep_h ep, unsigned flags,
3048	                                         uct_completion_t *comp)
3049	{
3050	    return ep->iface->ops.ep_flush(ep, flags, comp);
3051	}
3052	
3053	
3054	/**
(gdb) p ep
$1 = (uct_ep_h) 0x10014d454e0
(gdb) p ep->iface
$2 = (uct_iface_h) 0x138800001388
(gdb) p ep->iface->ops
Cannot access memory at address 0x138800001388

MTT had a slightly different signature than my manual run above (though it ran with --disable-debug):

   phase 8 part 2 (fence;loop(i){accum};fence) c-int nochk [st:lock]
  iter: 1000, time: 0.016633 sec
   phase 8 part 3 (fence;loop(i){accum};fence) nc-int chk [st:fence]
[1652847615.199257] [gnu-ompi-mtt-cn-1:57362:0]        rma_send.c:277  UCX  ERROR cannot use a remote key on a different endpoint than it was unpacked on
[1652847615.199260] [gnu-ompi-mtt-cn-1:57363:0]        rma_send.c:277  UCX  ERROR cannot use a remote key on a different endpoint than it was unpacked on
[gnu-ompi-mtt-cn-1:57362] *** An error occurred in MPI_Accumulate
   [gnu-ompi-mtt-cn-1:57362] *** reported by process [1298268161,2]
   [gnu-ompi-mtt-cn-1:57362] *** on win ucx window 3
   [gnu-ompi-mtt-cn-1:57362] *** MPI_ERR_OTHER: known error not in list
   [gnu-ompi-mtt-cn-1:57362] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
   [gnu-ompi-mtt-cn-1:57362] ***    and potentially your MPI job)
[gnu-ompi-mtt-cn-0:57755:0:57755] ib_mlx5_log.c:174  Remote access on mlx5_0:1/IB (synd 0x13 vend 0x88 hw_synd 0/0)
   [gnu-ompi-mtt-cn-0:57755:0:57755] ib_mlx5_log.c:174  RC QP 0x19921 wqe[235]: CSWAP s-- [rva 0x10032a51760 rkey 0x2400] [cmp 0 swap 4294967296] [va 0x7fff9c17fd78 len 8 lkey 0x9466e] [rqpn 0x384c7 dlid=13 sl=0 port=1 src_path_bits=0]
   ==== backtrace (tid:  57755) ====
    0  /opt/ucx-1.11.2/lib/libucs.so.0(ucs_handle_error+0x324) [0x7fffb89373b4]
    1  /opt/ucx-1.11.2/lib/libucs.so.0(ucs_fatal_error_message+0x118) [0x7fffb8932258]
    2  /opt/ucx-1.11.2/lib/libucs.so.0(ucs_log_default_handler+0x1388) [0x7fffb8939768]
    3  /opt/ucx-1.11.2/lib/libucs.so.0(ucs_log_dispatch+0xc0) [0x7fffb8939a30]
    4  /opt/ucx-1.11.2/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x624) [0x7fffb5e4a134]
    5  /opt/ucx-1.11.2/lib/ucx/libuct_ib.so.0(+0x4ee0c) [0x7fffb5e6ee0c]
    6  /opt/ucx-1.11.2/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_check_completion+0xc4) [0x7fffb5e4af64]
    7  /opt/ucx-1.11.2/lib/ucx/libuct_ib.so.0(+0x51e1c) [0x7fffb5e71e1c]
    8  /opt/ucx-1.11.2/lib/libucp.so.0(ucp_worker_progress+0x64) [0x7fffb8a4c074]
    9  /opt/mtt_scratch/ompi-4.1.x_gcc/installs/zyga/install/lib/libmpi_ftw.so.40(+0x2d98cc) [0x7fffb97f98cc]
   10  /opt/mtt_scratch/ompi-4.1.x_gcc/installs/zyga/install/lib/libmpi_ftw.so.40(+0x2d9a10) [0x7fffb97f9a10]
   11  /opt/mtt_scratch/ompi-4.1.x_gcc/installs/zyga/install/lib/libmpi_ftw.so.40(+0x2d9aa0) [0x7fffb97f9aa0]
   12  /opt/mtt_scratch/ompi-4.1.x_gcc/installs/zyga/install/lib/libmpi_ftw.so.40(+0x2dac3c) [0x7fffb97fac3c]
   13  /opt/mtt_scratch/ompi-4.1.x_gcc/installs/zyga/install/lib/libmpi_ftw.so.40(ompi_osc_ucx_accumulate+0xe0) [0x7fffb97fba88]
   14  /opt/mtt_scratch/ompi-4.1.x_gcc/installs/zyga/install/lib/libmpi_ftw.so.40(PMPI_Accumulate+0x5cc) [0x7fffb9695ba8]
   15  onesided/1sided() [0x10002f38]
   16  onesided/1sided() [0x1000442c]
   17  onesided/1sided() [0x10004914]
   18  onesided/1sided() [0x1000498c]
   19  /lib64/libc.so.6(+0x24c78) [0x7fffb92c4c78]
   20  /lib64/libc.so.6(__libc_start_main+0xb4) [0x7fffb92c4e64]

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions