-
Notifications
You must be signed in to change notification settings - Fork 908
V5.0.x: ofi: NIC selection update (revised) #11739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello! The Git Commit Checker CI bot found a few problems with this PR: 52aa19b: Do not compute distances when unbound
7884ac2: ofi: NIC selection update
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
Hello! The Git Commit Checker CI bot found a few problems with this PR: 6626503: Do not compute distances when unbound
eb2c108: ofi: Fix Coverity warnings
7884ac2: ofi: NIC selection update
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
The existing code in compare_cpusets assumed that some non_io ancestor of a PCI object should intersect with the cpuset of the proc. However, this is not true. There is a case where the non IO ancestor can be an L3. If there exists two L3s on the same NUMA and the process is bound to one L3, but the PCI object is connected to the other L3, then compare_cpusets() will return false. A better way to determine the optimal interface is by finding the distances of the interfaces from the current process. Then find out which of these interfaces is nearest the process and select it. Use the PMIx distance generation for this purpose. Move away from using deprecated PMIX macros and use the functions directly instead. Signed-off-by: Amir Shehata <shehataa@ornl.gov> (cherry picked from commit d4e1ae5)
Fix two coverity warnings: - Uninitialized variables (UNINIT) in /opal/mca/common/ofi/common_ofi.c: 859 in opal_common_ofi_select_provider() - Null pointer dereferences (FORWARD_NULL) /opal/mca/common/ofi/common_ofi.c: 636 in is_near() Move the "near" parameter in opal_common_ofi_select_provider() outside the #if OPAL_OFI_PCI_DATA_AVAILABLE block, because later in the code it's being used outside of it. Signed-off-by: Amir Shehata <shehataa@ornl.gov> (cherry picked from commit 47f18dc)
Unbound procs cannot have a device distance Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 8a446f6)
Revised PR for #11645 |
@amirshehataornl Could you hold on for a sec? I have a pending PR to add to this patch series #11728 |
This PR is meant to carry all the patches for the distances. So it will wait until 11728 lands on master, then I'll include it in the port. This way everything works. |
Gotcha. The change was just merged to main - it is needed to maintain backward compatibility for EFA, so I really appreciate if you could include those 2 commits as well. Or I can do it separately. let me know1 |
done |
*valin = val; | ||
return nearest; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why #if HWLOC_API_VERSION < 0x00020000
was removed in this change?
e.g. https://github.com/open-mpi/ompi/blob/main/opal/mca/common/ofi/common_ofi.c#LL608C1-L608C35
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like I missed porting
"Don't use hwloc v2 structures unless we have that version or above"
Trying to do that now
Encountered another bug. Need to add this PR as well #11744 |
Protect the logic that uses v2 structures from compiling with hwloc v1.x Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit c5a8a78)
This patch fixes a bug in device matching. uuid alone is not sufficient to match openfabrics devices, and therefore we need to also match osname. Signed-off-by: Wenduo Wang <wenduwan@amazon.com> (cherry picked from commit 09fd7a1)
Adjust string lengths according to the uuid format Signed-off-by: Wenduo Wang <wenduwan@amazon.com> (cherry picked from commit 2b282c1)
@amirshehataornl Could you cherry pick these changes? #11744 There are latent bugs in the get package rank function. |
This patch fixes a bug in package rank calculation. The peer rank returned via PMIX_LOCAL_PEERS is the global rank. Matching it with local rank was wrong. Signed-off-by: Wenduo Wang <wenduwan@amazon.com> (cherry picked from commit 88077e1)
Fix a double free bug on the error path. This causes a harsh crash instead of returning an error code. Signed-off-by: Wenduo Wang <wenduwan@amazon.com> (cherry picked from commit 002f5b4)
PR #11739 Updated |
bot:ibm:retest |
bot:aws:retest |
@wenduwan and @amirshehataornl please review, and we'll get it in. |
@gpaulsen Yes I plan to test out the change today/tomo and report back. |
@amirshehataornl Got a compile error with internal (old) hwloc and libevent
Need to pick up #11693 as well |
I cherry picked 9046276 and git rebased - ran through a couple 2 nodes tests and the change looks good. @amirshehataornl Thank you! |
Is this ready to merge? |
It looks good from my side. |
@gpaulsen I think @amirshehataornl missed 9046276 I will open a PR to cherry-pick that change |
The |
not sure I follow ompi/opal/mca/common/ofi/common_ofi.c Line 723 in 9717dbf
|
@amirshehataornl Sorry I meant |
hmmm how did that build in the first place? Are you creating a PR to fix that, or should I? |
Unfortunately Open MPI CI does not build OFI so this was not caught. I have opened a backport PR #11775 |
Port of the distances patches to OMPI 5.0.x release