-
Notifications
You must be signed in to change notification settings - Fork 905
Accessing the HWLOC topology tree
There are several mechanisms by which OMPI may obtain an HWLOC topology, depending on the environment within which the application is executing and the method by which the application was started. As PMIx continues to roll out across the environment, the variations in how OMPI deals with the topology will hopefully simplify. In the interim, however, OMPI must deal with a variety of use-cases. This document attempts to capture those situations and explain how OMPI interacts with the topology.
Note: this document pertains to version 5.0 and above - while elements of the following discussion can be found in earlier OMPI versions, there may exist nuances that modify their application to that situation. In v5.0 and above, PRRTE is used as the OMPI RTE, and PRRTE (PMIx Reference RunTime Environment) is built with PMIx as its core foundation. Key to the discussion, therefore, is that OMPI v5.0 and above requires PRRTE 2.0 or above, which in turn requires PMIx v4.01 or above.
It is important to note that it is PMIx (and not PRRTE itself) that is often providing the HWLOC topology to the application. This is definitely the case for mpirun launch, and other environments have (so far) followed that model. If PMIx provides the topology, it will come in several forms:
-
if HWLOC 2.x or above is used, then the primary form will be via HWLOC's shmem feature. The shmem rendezvous information is provided in a set of three PMIx keys (PMIX_HWLOC_SHMEM_FILE, PMIX_HWLOC_SHMEM_ADDR, and PMIX_HWLOC_SHMEM_SIZE)
-
if HWLOC 2.x or above is used, then PMIx will also provide the topology as a HWLOC v2 XML string. Although one could argue it is a duplication of information, it is provided by default to support environments where shmem may not be available or authorized between the server and client processes (more on that below)
-
regardless of HWLOC version, PMIx also provides the topology as a HWLOC v1 XML string to support client applications that are linked against an older HWLOC version
Should none of those be available, or if the user has specified a topology file that is to be used in place of whatever the environment provides, then OMPI will either read the topology from the file or perform its own local discovery. The latter is highly discouraged as it leads to significant scaling issues (both in terms of startup time and memory footprint) on complex machines with many cores and multiple layers in their memory hierarchy.
Once the topology has been obtained, the next question one must consider is: what does that topology represent? Is it the topology assigned to the application itself (e.g., via cgroup)? Or is it the overall topology as seen by the base OS? OMPI is designed to utilize the former - i.e., it expects to see the topology assigned to the application, and thus considers any resources present in the topology to be available for its use. It is therefore important to be able to identify the scope of the topology, and to appropriately filter it when necessary.
Unfortunately, the answer to that question depends upon the method of launch, and (in the case of direct launch) on the architecture of the host environment.
-
mpirun launch
-
direct launch
- System daemon hosting PMIx server
- Per-job (step) daemon hosting PMIx server
-
singleton