Open
Description
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.1.x
branch
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
git clone of the v4.1.x
branch at 0987319
Please describe the system on which you are running
- Operating system/version: RHEL 8.4
- Computer hardware: ppc64le
- Network type: Infiniband with mlx5 cards
Details of the problem
MTT found this problem while testing the v4.1.x
branch and running the ibm/onesided/1sided
test. I'm using an older UCX (1.11.2) because I have an older MOFED (MLNX_OFED_LINUX-4.9-4.1.1.1
) and that is what's supported. So this might be a UCX issue, but I'm not sure.
Open MPI was configured with:
./configure --enable-mpirun-prefix-by-default --disable-dlopen --enable-io-romio \
--disable-io-ompio --enable-mpi1-compatibility \
--with-ucx=/opt/ucx-1.11.2/ --without-hcoll \
--enable-debug --enable-picky
The test case was run with 3 nodes and 2 processes per node:
mpirun --host f5n18:20,f5n17:20,f5n16:20 --npernode 2 -mca pml ucx -mca osc ucx,sm -mca btl ^openib ./1sided
The test runs for a while, but in phase 8
one or more of the processes with crash:
seed value: 1610634988
[mesgsize 5976]
phase 5 part 1 (loop(i){fence;get;fence}) c-int chk [st:fence]
phase 5 part 1 (loop(i){fence;get;fence}) c-int chk [st:post]
iter: 61525, time: 3.000061 sec
phase 5 part 1 (loop(i){fence;get;fence}) c-int chk [st:test]
iter: 61133, time: 3.000064 sec
...
phase 8 part 3 (fence;loop(i){accum};fence) nc-int chk [st:fence]
[f5n16:2646755:0:2646755] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfa000000fa0)
==== backtrace (tid:2646755) ====
0 /smpi_dev/jjhursey/local/ucx-1.11.2/lib/libucs.so.0(ucs_handle_error+0x324) [0x7fff825273b4]
1 /smpi_dev/jjhursey/local/ucx-1.11.2/lib/libucs.so.0(+0x37560) [0x7fff82527560]
2 /smpi_dev/jjhursey/local/ucx-1.11.2/lib/libucs.so.0(+0x37990) [0x7fff82527990]
3 linux-vdso64.so.1(__kernel_sigtramp_rt64+0) [0x7fff835e04d8]
4 [0xfa000000fa0]
5 /smpi_dev/jjhursey/dev/ompi/install/ompi-v4.1-debug/lib/libmpi_ftw.so.40(PMPI_Win_fence+0x1a0) [0x7fff8325bdc8]
6 ./1sided() [0x10004538]
7 ./1sided() [0x10004934]
8 ./1sided() [0x100049ac]
9 /lib64/libc.so.6(+0x24c78) [0x7fff82e84c78]
10 /lib64/libc.so.6(__libc_start_main+0xb4) [0x7fff82e84e64]
=================================
[f5n17:1073992:0:1073992] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7d0000008a0)
==== backtrace (tid:1073992) ====
0 /smpi_dev/jjhursey/local/ucx-1.11.2/lib/libucs.so.0(ucs_handle_error+0x324) [0x7fffb73073b4]
1 /smpi_dev/jjhursey/local/ucx-1.11.2/lib/libucs.so.0(+0x37560) [0x7fffb7307560]
2 /smpi_dev/jjhursey/local/ucx-1.11.2/lib/libucs.so.0(+0x37990) [0x7fffb7307990]
3 linux-vdso64.so.1(__kernel_sigtramp_rt64+0) [0x7fffb83c04d8]
4 /smpi_dev/jjhursey/local/ucx-1.11.2/lib/ucx/libuct_ib.so.0(uct_rc_iface_flush+0xb0) [0x7fffb4401e00]
5 /smpi_dev/jjhursey/local/ucx-1.11.2/lib/libucp.so.0(+0x6a4e4) [0x7fffb743a4e4]
6 /smpi_dev/jjhursey/local/ucx-1.11.2/lib/libucp.so.0(ucp_worker_flush_nbx+0x1c4) [0x7fffb743cd94]
7 /smpi_dev/jjhursey/local/ucx-1.11.2/lib/libucp.so.0(ucp_worker_flush_nb+0x54) [0x7fffb743cf04]
8 /smpi_dev/jjhursey/dev/ompi/install/ompi-v4.1-debug/lib/libmpi_ftw.so.40(+0x2e3dc4) [0x7fffb81a3dc4]
9 /smpi_dev/jjhursey/dev/ompi/install/ompi-v4.1-debug/lib/libmpi_ftw.so.40(ompi_osc_ucx_fence+0xcc) [0x7fffb81a41c4]
10 /smpi_dev/jjhursey/dev/ompi/install/ompi-v4.1-debug/lib/libmpi_ftw.so.40(PMPI_Win_fence+0x1a0) [0x7fffb803bdc8]
11 ./1sided() [0x10004538]
12 ./1sided() [0x10004934]
13 ./1sided() [0x100049ac]
14 /lib64/libc.so.6(+0x24c78) [0x7fffb7c64c78]
15 /lib64/libc.so.6(__libc_start_main+0xb4) [0x7fffb7c64e64]
=================================
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 4 with PID 2646755 on node f5n16 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
2 total processes killed (some possibly by mpirun during cleanup)
The stack I got out of gdb
is:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007fffb1f21e00 in uct_ep_flush (comp=0x0, flags=0, ep=0x10014d454e0)
at /opt/ucx-1.11.2/src/uct/api/uct.h:3050
3050 return ep->iface->ops.ep_flush(ep, flags, comp);
[Current thread is 1 (Thread 0x7fffb5f5e7f0 (LWP 2635823))]
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-151.el8.ppc64le libblkid-2.32.1-27.el8.ppc64le libevent-2.1.8-5.el8.ppc64le libgcc-8.4.1-1.el8.ppc64le libibverbs-41mlnx1-OFED.4.9.3.0.0.49411.ppc64le libmlx4-41mlnx1-OFED.4.7.3.0.3.49411.ppc64le libmlx5-41mlnx1-OFED.4.9.0.1.2.49411.ppc64le libmount-2.32.1-27.el8.ppc64le libnl3-3.5.0-1.el8.ppc64le librdmacm-41mlnx1-OFED.4.7.3.0.6.49411.ppc64le libselinux-2.9-5.el8.ppc64le libuuid-2.32.1-27.el8.ppc64le numactl-libs-2.0.12-11.el8.ppc64le openssl-libs-1.1.1g-15.el8_3.ppc64le pcre2-10.32-2.el8.ppc64le systemd-libs-239-45.el8_4.3.ppc64le zlib-1.2.11-17.el8.ppc64le
(gdb) bt
#0 0x00007fffb1f21e00 in uct_ep_flush (comp=0x0, flags=0, ep=0x10014d454e0)
at /opt/ucx-1.11.2/src/uct/api/uct.h:3050
#1 uct_rc_iface_flush (tl_iface=0x10014c3b4b0, flags=<optimized out>, comp=<optimized out>) at rc/base/rc_iface.c:294
#2 0x00007fffb4f5a4e4 in uct_iface_flush (comp=0x0, flags=0, iface=<optimized out>)
at /opt/ucx-1.11.2/src/uct/api/uct.h:2627
#3 ucp_worker_flush_check (worker=0x10014bcc150) at rma/flush.c:411
#4 0x00007fffb4f5cd94 in ucp_worker_flush_nbx_internal (param=0x7fffedc528f0, worker=0x10014bcc150) at rma/flush.c:554
#5 ucp_worker_flush_nbx (worker=0x10014bcc150, param=0x7fffedc528f0) at rma/flush.c:596
#6 0x00007fffb4f5cf04 in ucp_worker_flush_nb (worker=<optimized out>, flags=<optimized out>, cb=<optimized out>) at rma/flush.c:586
#7 0x00007fffb5cc3dc4 in opal_common_ucx_worker_flush (worker=0x10014bcc150) at ../../../../opal/mca/common/ucx/common_ucx.h:179
#8 0x00007fffb5cc41c4 in ompi_osc_ucx_fence (assert=0, win=0x10014b5cb60) at osc_ucx_active_target.c:77
#9 0x00007fffb5b5bdc8 in PMPI_Win_fence (assert=0, win=0x10014b5cb60) at pwin_fence.c:60
#10 0x0000000010004538 in main_test_fn (comm=0x7fffb5ec39a8 <ompi_mpi_comm_world>, tid=1) at 1sided.c:702
#11 0x0000000010004934 in runtest (comm=0x7fffb5ec39a8 <ompi_mpi_comm_world>, tid=1) at 1sided.c:762
#12 0x00000000100049ac in main () at 1sided.c:772
(gdb) l
3045 * upon completion of these operations.
3046 */
3047 UCT_INLINE_API ucs_status_t uct_ep_flush(uct_ep_h ep, unsigned flags,
3048 uct_completion_t *comp)
3049 {
3050 return ep->iface->ops.ep_flush(ep, flags, comp);
3051 }
3052
3053
3054 /**
(gdb) p ep
$1 = (uct_ep_h) 0x10014d454e0
(gdb) p ep->iface
$2 = (uct_iface_h) 0x138800001388
(gdb) p ep->iface->ops
Cannot access memory at address 0x138800001388
MTT had a slightly different signature than my manual run above (though it ran with --disable-debug
):
phase 8 part 2 (fence;loop(i){accum};fence) c-int nochk [st:lock]
iter: 1000, time: 0.016633 sec
phase 8 part 3 (fence;loop(i){accum};fence) nc-int chk [st:fence]
[1652847615.199257] [gnu-ompi-mtt-cn-1:57362:0] rma_send.c:277 UCX ERROR cannot use a remote key on a different endpoint than it was unpacked on
[1652847615.199260] [gnu-ompi-mtt-cn-1:57363:0] rma_send.c:277 UCX ERROR cannot use a remote key on a different endpoint than it was unpacked on
[gnu-ompi-mtt-cn-1:57362] *** An error occurred in MPI_Accumulate
[gnu-ompi-mtt-cn-1:57362] *** reported by process [1298268161,2]
[gnu-ompi-mtt-cn-1:57362] *** on win ucx window 3
[gnu-ompi-mtt-cn-1:57362] *** MPI_ERR_OTHER: known error not in list
[gnu-ompi-mtt-cn-1:57362] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[gnu-ompi-mtt-cn-1:57362] *** and potentially your MPI job)
[gnu-ompi-mtt-cn-0:57755:0:57755] ib_mlx5_log.c:174 Remote access on mlx5_0:1/IB (synd 0x13 vend 0x88 hw_synd 0/0)
[gnu-ompi-mtt-cn-0:57755:0:57755] ib_mlx5_log.c:174 RC QP 0x19921 wqe[235]: CSWAP s-- [rva 0x10032a51760 rkey 0x2400] [cmp 0 swap 4294967296] [va 0x7fff9c17fd78 len 8 lkey 0x9466e] [rqpn 0x384c7 dlid=13 sl=0 port=1 src_path_bits=0]
==== backtrace (tid: 57755) ====
0 /opt/ucx-1.11.2/lib/libucs.so.0(ucs_handle_error+0x324) [0x7fffb89373b4]
1 /opt/ucx-1.11.2/lib/libucs.so.0(ucs_fatal_error_message+0x118) [0x7fffb8932258]
2 /opt/ucx-1.11.2/lib/libucs.so.0(ucs_log_default_handler+0x1388) [0x7fffb8939768]
3 /opt/ucx-1.11.2/lib/libucs.so.0(ucs_log_dispatch+0xc0) [0x7fffb8939a30]
4 /opt/ucx-1.11.2/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x624) [0x7fffb5e4a134]
5 /opt/ucx-1.11.2/lib/ucx/libuct_ib.so.0(+0x4ee0c) [0x7fffb5e6ee0c]
6 /opt/ucx-1.11.2/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_check_completion+0xc4) [0x7fffb5e4af64]
7 /opt/ucx-1.11.2/lib/ucx/libuct_ib.so.0(+0x51e1c) [0x7fffb5e71e1c]
8 /opt/ucx-1.11.2/lib/libucp.so.0(ucp_worker_progress+0x64) [0x7fffb8a4c074]
9 /opt/mtt_scratch/ompi-4.1.x_gcc/installs/zyga/install/lib/libmpi_ftw.so.40(+0x2d98cc) [0x7fffb97f98cc]
10 /opt/mtt_scratch/ompi-4.1.x_gcc/installs/zyga/install/lib/libmpi_ftw.so.40(+0x2d9a10) [0x7fffb97f9a10]
11 /opt/mtt_scratch/ompi-4.1.x_gcc/installs/zyga/install/lib/libmpi_ftw.so.40(+0x2d9aa0) [0x7fffb97f9aa0]
12 /opt/mtt_scratch/ompi-4.1.x_gcc/installs/zyga/install/lib/libmpi_ftw.so.40(+0x2dac3c) [0x7fffb97fac3c]
13 /opt/mtt_scratch/ompi-4.1.x_gcc/installs/zyga/install/lib/libmpi_ftw.so.40(ompi_osc_ucx_accumulate+0xe0) [0x7fffb97fba88]
14 /opt/mtt_scratch/ompi-4.1.x_gcc/installs/zyga/install/lib/libmpi_ftw.so.40(PMPI_Accumulate+0x5cc) [0x7fffb9695ba8]
15 onesided/1sided() [0x10002f38]
16 onesided/1sided() [0x1000442c]
17 onesided/1sided() [0x10004914]
18 onesided/1sided() [0x1000498c]
19 /lib64/libc.so.6(+0x24c78) [0x7fffb92c4c78]
20 /lib64/libc.so.6(__libc_start_main+0xb4) [0x7fffb92c4e64]