Skip to content

[5.0.0] MQD library gives corrupted communicator name #12063

@david-edwards-linaro

Description

@david-edwards-linaro

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

5.0.0

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

57c405c52ad76bab0be9f95e29a6df660673081e 3rd-party/openpmix (v4.2.7)
1552e36f0852bbc6d901ec95983369f0a3c283f6 3rd-party/prrte (v3.0.2)
c1cfc910d92af43f8c27807a9a84c9c13f4fbc65 config/oac (heads/main)

Please describe the system on which you are running

  • Operating system/version: Ubuntu 22.04
  • Computer hardware: x86_64

Details of the problem

Communicator names retrieved via the Message Queue Debugging library libompi_dbg_msgq.so appear corrupted, for example "0��^A" instead of "MPI_COMM_WORLD".
This appears to be because in 5.0.x the c_name field in ompi_communicator_t is a char*, whereas previously it was char[MPI_MAX_OBJECT_NAME], and rebuild_communicator_list() in ompi_msgq_dll.c still assumes that the string content can be obtained as a field/segment of the communicator structure.
The rebuild_communicator_list() function would instead need to fetch the c_name field as a pointer and use it to obtain the null-terminated string.
The mqs_ interface does not provide for fetching null-terminated strings, so options would be to fetch (via the debugger) a chunk of memory which is an upper bound on the length of the string, or to (inefficiently) fetch each byte of the string until the null terminator is found.
An illustrative patch for the first is as follows:

diff --git a/ompi/debuggers/ompi_msgq_dll.c b/ompi/debuggers/ompi_msgq_dll.c
index 4516b8df23..b79fa5886f 100644
--- a/ompi/debuggers/ompi_msgq_dll.c
+++ b/ompi/debuggers/ompi_msgq_dll.c
@@ -682,8 +682,11 @@ static int rebuild_communicator_list (mqs_process *proc)
                                     p_info );
             old->group = find_or_create_group( proc, group_base );
         }
-        mqs_fetch_data( proc, comm_ptr + i_info->ompi_communicator_t.offset.c_name,
-                        64, old->comm_info.name );
+        mqs_taddr_t name_addr = ompi_fetch_pointer( proc, comm_ptr + i_info->ompi_communicator_t.offset.c_name, p_info );
+        mqs_fetch_data( proc, name_addr, 64, old->comm_info.name );
+        old->comm_info.name[63] = '\0';
+        size_t name_len = strlen(old->comm_info.name);
+        memset(&old->comm_info.name[name_len], 0, 64-name_len);
 
         if( NULL != old->group ) {
             old->comm_info.size = old->group->entries;

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions