Skip to content

segfault when trying to open significantly too many contexts #10370

Open
@joshfisher-cornelisnetworks

Description

Thank you for taking the time to submit an issue!

Background information

Found issue when using a 2 HFI system but 1 was disabled causing command to open way too many contexts for 1 HFI

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.1.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

No output

Please describe the system on which you are running

  • Operating system/version: RHEL 7.9
  • Computer hardware: x86_64
  • Network type: Back-to-back pair

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$ mpirun -n 2 ./hello_world

In a 2 HFI system there was 1 HFI disabled and ran a test that would work on a 2 HFI system. Expected a failure due to too many contexts, but got a segfault instead of a more graceful abort. Found that when running with np and ppr closer to the limit, but still over, there is a more graceful abort.

command ran:
openmpi-v4.1.2/bin/mpirun -np 192 --map-by ppr:96:node -host hostA:96,hostB:96 --bind-to core --display-map --tag-output --allow-run-as-root --mca mtl ofi --mca btl ofi -x LD_LIBRARY_PATH=path/to/opx/build -x FI_PROVIDER=opx FI_LOG_LEVEL=warn -x IMB-MPI1 -include Uniband,Biband -npmin 192 -iter 10000 -msglog 0:15

Backtrace found:

#0 0x00007fd9e35eb6e8 in mca_btl_ofi_context_finalize () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/openmpi/mca_btl_ofi.so
#1 0x00007fd9e35ebab9 in mca_btl_ofi_context_alloc_scalable () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/openmpi/mca_btl_ofi.so
#2 0x00007fd9e35e7f9f in mca_btl_ofi_component_init () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/openmpi/mca_btl_ofi.so
#3 0x00007fd9f3a74d16 in mca_btl_base_select () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/libopen-pal.so.40
#4 0x00007fd9e37f2441 in mca_bml_r2_component_init () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/openmpi/mca_bml_r2.so
#5 0x00007fd9f4e5f3ce in mca_bml_base_init () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/libmpi.so.40
#6 0x00007fd9f4e9d4fd in ompi_mpi_init () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/libmpi.so.40
#7 0x00007fd9f4e46875 in PMPI_Init_thread () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/libmpi.so.40
#8 0x0000000000405265 in main (argc=9, argv=0x7ffdfc1a2938) at imb.cpp:295

Metadata

Metadata

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions