Skip to content

Commit 651ef79

Browse files
committed
OMPI/COMM: be more conservative about when a comm is ready
Only return a pointer to a ompi_commuicator_t struct now from - ompi_comm_lookup - ompi_comm_lookup_cid when the communicator has a PML associated with it. The ompi_comm_lookup function will continue to return NULL if the entry designated by the c_index argument in the ompi_mpi_communicator table is OMPI_COMM_SENTINEL. OLD COMMIT MESSAGE BEFORE REFACTOR This patch addresses a race condition in OB1. One way this race condition is encountered is when using MPI_Comm_spawn under oversubscribed conditions. The fundamental reason for this race condition existing is that the CID allocation procedure for intercommunicators does not have a barrier in the onpi_comm_activate_nb procedure. As a result, it is possible for a process to receive a fragement (message) from another process participating in the spawn procedure and still be in the cid allocation procedure (within ompi_comm_next_cid_nb). The process may have allocated a suitable slot in the ompi_mpi_communicators table but not yet associated it with a PML. So in this code path it is necessary to check both a valid cid for the incoming message headers' context is present in ompi_mpi_communicators and a PML is associated with this communicator. This problem is specific to inter communicators at the time of this PR as intracommunicators have a barrier like behavior in ompi_comm_activate_nb. Signed-off-by: Howard Pritchard <howardp@lanl.gov>
1 parent 0f17273 commit 651ef79

File tree

1 file changed

+15
-1
lines changed

1 file changed

+15
-1
lines changed

ompi/communicator/communicator.h

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@
2222
* Copyright (c) 2015 Research Organization for Information Science
2323
* and Technology (RIST). All rights reserved.
2424
* Copyright (c) 2016-2017 IBM Corporation. All rights reserved.
25-
* Copyright (c) 2018-2022 Triad National Security, LLC. All rights
25+
* Copyright (c) 2018-2024 Triad National Security, LLC. All rights
2626
* reserved.
2727
* Copyright (c) 2023 Advanced Micro Devices, Inc. All rights reserved.
2828
* $COPYRIGHT$
@@ -568,6 +568,13 @@ static inline ompi_communicator_t *ompi_comm_lookup (const uint32_t c_index)
568568
comm = NULL;
569569
}
570570

571+
/*
572+
* return NULL if comm doesn't yet have an associated PML
573+
*/
574+
if ((NULL != comm) && !OMPI_COMM_IS_PML_ADDED(comm)) {
575+
comm = NULL;
576+
}
577+
571578
return comm;
572579
}
573580

@@ -584,6 +591,13 @@ static inline ompi_communicator_t *ompi_comm_lookup_cid (const ompi_comm_extende
584591
{
585592
ompi_communicator_t *comm = NULL;
586593
(void) opal_hash_table_get_value_ptr (&ompi_comm_hash, &cid, sizeof (cid), (void **) &comm);
594+
/*
595+
* return NULL if the comm does not yet have an asociated PML
596+
*/
597+
if ((NULL != comm) && !OMPI_COMM_IS_PML_ADDED(comm)) {
598+
comm = NULL;
599+
}
600+
587601
return comm;
588602
}
589603

0 commit comments

Comments
 (0)