Skip to content

Assertion `reserve > 0' failed running collective-big-count tests using v4.1.x branch and --mca coll adapt,basic,sm,self,inter,libnbc option #10221

Open
@drwootton

Description

@drwootton

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

OpenMPI v4.1.x branch

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Built from current v4.1.x branch (3/22/22)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

git submodule status does not display anything.

Please describe the system on which you are running

  • Operating system/version:
  • RHEL 8.4
  • Computer hardware:
  • Single Power8 node
  • Network type:
  • Localhost

Details of the problem

I ran the set of self-checking tests from ompi-tests-public/collective-big-count with collective components specified as --mca coll_adapt_priority 100 --mca coll adapt,basic,sm,self,inter,libnbc

The following environment variables were set for all tests:

BIGCOUNT_HOSTS : -np 3
BIGCOUNT_MEMORY_PERCENT : 70
BIGCOUNT_MEMORY_DIFF : 10

For instance, I ran this command

mpirun ${BIGCOUNT_HOSTS} -x BIGCOUNT_MEMORY_PERCENT=${BIGCOUNT_MEMORY_PERCENT} -x BIGCOUNT_MEMORY_DIFF=${BIGCOUNT_MEMORY_DIFF} --mca btl ^openib --mca coll_adapt_priority 100 --mca coll adapt,basic,sm,self,inter,libnbc /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allgather_uniform_count

The command failed with this assert and traceback

test_allgather_uniform_count: ../../../../ompi/mca/coll/base/coll_base_util.h:73: ompi_coll_base_nbc_reserve_tags: Assertion `reserve > 0' failed.
[c656f6n01:2537658] *** Process received signal ***
[c656f6n01:2537658] Signal: Aborted (6)
[c656f6n01:2537658] Signal code:  (-6)
[c656f6n01:2537658] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8]
[c656f6n01:2537658] [ 1] /lib64/libc.so.6(gsignal+0xd8)[0x2000003c44d8]
[c656f6n01:2537658] [ 2] /lib64/libc.so.6(abort+0x164)[0x2000003a462c]
[c656f6n01:2537658] [ 3] /lib64/libc.so.6(+0x37c70)[0x2000003b7c70]
[c656f6n01:2537658] [ 4] /lib64/libc.so.6(__assert_fail+0x64)[0x2000003b7d14]
[c656f6n01:2537658] [ 5] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_coll_adapt.so(+0x544c)[0x200002ef544c]
[c656f6n01:2537658] [ 6] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_coll_adapt.so(+0x76a4)[0x200002ef76a4]
[c656f6n01:2537658] [ 7] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_coll_adapt.so(ompi_coll_adapt_ibcast+0x12c)[0x200002ef7118]
[c656f6n01:2537658] [ 8] /u/dwootton/ompi-4-1-x/lib/openmpi/mca_coll_adapt.so(ompi_coll_adapt_bcast+0x70)[0x200002ef3a30]
[c656f6n01:2537658] [ 9] /u/dwootton/ompi-4-1-x/lib/libmpi.so.40(ompi_coll_base_allgather_intra_basic_linear+0x22c)[0x2000001fc32c]
[c656f6n01:2537658] [10] /u/dwootton/ompi-4-1-x/lib/libmpi.so.40(MPI_Allgather+0x3c0)[0x200000129ec4]
[c656f6n01:2537658] [11] /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allgather_uniform_count[0x10002fc4]
[c656f6n01:2537658] [12] /u/dwootton/bigcount-41x/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allgather_uniform_count[0x10002814]
[c656f6n01:2537658] [13] /lib64/libc.so.6(+0x24c78)[0x2000003a4c78]
[c656f6n01:2537658] [14] /lib64/libc.so.6(__libc_start_main+0xb4)[0x2000003a4e64]
[c656f6n01:2537658] *** End of error message ***

The following testcases had this failure

  • test_allgather_uniform_count
  • test-allreduce-uniform_count
  • test-bcast-uniform-count
  • test-reduce-uniform-count

The tests were compiled by running make in the directory containing the source files

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions