Open
Description
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
OpenMPI v4.1.x branch
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Built from current v4.1.x branch (3/22/22)
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status
.
git submodule status does not display anything.
Please describe the system on which you are running
- Operating system/version:
- RHEL 8.4
- Computer hardware:
- Single Power8 node
- Network type:
- Localhost
Details of the problem
I ran the set of self-checking tests from ompi-tests-public/collective-big-count with collective components specified as --mca coll_adapt_priority 100 --mca coll adapt,basic,sm,self,inter,libnbc
The following environment variables were set for all tests:
BIGCOUNT_HOSTS : -np 3
BIGCOUNT_MEMORY_PERCENT : 70
BIGCOUNT_MEMORY_DIFF : 10
For instance, I ran this command
mpirun ${BIGCOUNT_HOSTS} -x BIGCOUNT_MEMORY_PERCENT=${BIGCOUNT_MEMORY_PERCENT} -x BIGCOUNT_MEMORY_DIFF=${BIGCOUNT_MEMORY_DIFF} --mca btl ^openib --mca coll_adapt_priority 100 --mca coll adapt,basic,sm,self,inter,libnbc /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allgather_uniform_count
The command failed with this assert and traceback
test_allgather_uniform_count: ../../../../ompi/mca/coll/base/coll_base_util.h:73: ompi_coll_base_nbc_reserve_tags: Assertion `reserve > 0' failed.
[c656f6n01:2537658] *** Process received signal ***
[c656f6n01:2537658] Signal: Aborted (6)
[c656f6n01:2537658] Signal code: (-6)
[c656f6n01:2537658] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8]
[c656f6n01:2537658] [ 1] /lib64/libc.so.6(gsignal+0xd8)[0x2000003c44d8]
[c656f6n01:2537658] [ 2] /lib64/libc.so.6(abort+0x164)[0x2000003a462c]
[c656f6n01:2537658] [ 3] /lib64/libc.so.6(+0x37c70)[0x2000003b7c70]
[c656f6n01:2537658] [ 4] /lib64/libc.so.6(__assert_fail+0x64)[0x2000003b7d14]
[c656f6n01:2537658] [ 5] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_coll_adapt.so(+0x544c)[0x200002ef544c]
[c656f6n01:2537658] [ 6] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_coll_adapt.so(+0x76a4)[0x200002ef76a4]
[c656f6n01:2537658] [ 7] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_coll_adapt.so(ompi_coll_adapt_ibcast+0x12c)[0x200002ef7118]
[c656f6n01:2537658] [ 8] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_coll_adapt.so(ompi_coll_adapt_bcast+0x70)[0x200002ef3a30]
[c656f6n01:2537658] [ 9] /u/dwootton/ompi-4-1-x/lib/libmpi.so.40(ompi_coll_base_allgather_intra_basic_linear+0x22c)[0x2000001fc32c]
[c656f6n01:2537658] [10] /u/dwootton/ompi-4-1-x/lib/libmpi.so.40(MPI_Allgather+0x3c0)[0x200000129ec4]
[c656f6n01:2537658] [11] /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allgather_uniform_count[0x10002fc4]
[c656f6n01:2537658] [12] /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allgather_uniform_count[0x10002814]
[c656f6n01:2537658] [13] /lib64/libc.so.6(+0x24c78)[0x2000003a4c78]
[c656f6n01:2537658] [14] /lib64/libc.so.6(__libc_start_main+0xb4)[0x2000003a4e64]
[c656f6n01:2537658] *** End of error message ***
The following testcases had this failure
- test_allgather_uniform_count
- test-allreduce-uniform_count
- test-bcast-uniform-count
- test-reduce-uniform-count
The tests were compiled by running make in the directory containing the source files