-
Notifications
You must be signed in to change notification settings - Fork 908
coll/sync: add support for alltoallv and alltoallw #3262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Adds alltoallv to coll.sync analogous to other collectives. Two additional control variables: barrier_before_alltoallv and barrier_after_alltoallv analogous to barrier_before and barrier_after, respectively. This way sync may be activated for alltoallv only. However calls to alltoallv also increment the other counters. Signed-off-by: Harald Braun harald.braun@atos.net
Signed-off-by: Harald Braun harald.braun@atos.net
@DrHaraldBraun do you have an app that is helped by having this sync option for alltoallv/w? |
Can one of the admins verify this patch? |
okay to test |
@hppritcha My understanding of the status of this pr was that we could not identify a technical reason why this patch would be necessary, since Alltoall(v/w) act synchronizing anywyay. That's why we asked for an example from the author where this was an issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of this change in general. coll_sync can be quite useful in a number of diagnostic situations, even for synchronizing collectives.
Just a few minor issues around the new mca parameters compatibility with existing parameters.
OBJ_RELEASE(module->c_coll.coll_reduce_scatter_module); | ||
OBJ_RELEASE(module->c_coll.coll_scatter_module); | ||
OBJ_RELEASE(module->c_coll.coll_scatterv_module); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be an 'else' just a close bracket. Couldn't someone run with both? Of course we don't want to obj_release twice, so might need to rework the logic a bit.
sync_module->super.coll_allgatherv = NULL; | ||
sync_module->super.coll_allreduce = NULL; | ||
sync_module->super.coll_alltoall = NULL; | ||
if (0 != mca_coll_sync_component.barrier_before_nops_alltoallv || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, can users only use barrier_[before|after]_nops_alltoall[v|w] if they don't use barrier_before_nops mechanism?
CHECK_AND_RETAIN(reduce_scatter); | ||
CHECK_AND_RETAIN(scatter); | ||
CHECK_AND_RETAIN(scatterv); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, perhaps just a close bracket instead of the 'else', and some protection to only CHECK_AND_RETAIN alltoall[v|w] once if it goes through both sections of code.
@DrHaraldBraun, are you still interested in this? Would you mind if I cleaned this up? |
@gpaulsen / @DrHaraldBraun do we still want this? |
@gpaulsen let's discuss the fate of this for 5.0. |
Re-ping. 5.0 branching is targeted for April 30th. If you want this in for 5.0, please target to get it in master by then. Thanks! |
I'll re-review this and try to get in for v5.0.0 |
Rebased version of @DrHaraldBraun's PR open-mpi#3262 [PATCH 1/2] alltoallv desync fix via coll.sync component [PATCH 2/2] coll/sync: add alltoallw analoguous to alltoallv Adds alltoallv to coll.sync analogous to other collectives. Two additional control variables: barrier_before_alltoallv and barrier_after_alltoallv analogous to barrier_before and barrier_after, respectively. This way sync may be activated for alltoallv only. However calls to alltoallv also increment the other counters. Signed-off-by: Harald Braun harald.braun@atos.net
It looks like this PR was auto-closed when we renamed the Can you re-open and/or re-submit this to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR makes the support for sync alltoall[v,w] widely different from the others collectives. I can't see why and there are no comments justifying this need. We need either to fix the code or justify the need to have a different implementation.
However, there is a bigger issue with this component. The in_operation field protects only in a single thread scenario, but can lead to non-deterministic applications if used in a multithreaded application. The sync should disqualify itself if the library is working in THREAD_MULTIPLE mode.
Refers to / fixes: #3255
Adds
alltoallv
tocoll/sync
analogous to other collectives.Two additional control variables: barrier_before_alltoallv and
barrier_after_alltoallv analogous to barrier_before and barrier_after,
respectively. This way
sync
may be activated foralltoallv
only.However calls to
alltoallv
also increment the other counters.edit:
and the same for
alltoallw