Skip to content

coll/sync: add support for alltoallv and alltoallw #3262

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

coll/sync: add support for alltoallv and alltoallw #3262

wants to merge 2 commits into from

Conversation

haraldBraun-atos
Copy link

@haraldBraun-atos haraldBraun-atos commented Mar 31, 2017

Refers to / fixes: #3255

Adds alltoallv to coll/sync analogous to other collectives.
Two additional control variables: barrier_before_alltoallv and
barrier_after_alltoallv analogous to barrier_before and barrier_after,
respectively. This way sync may be activated for alltoallv only.
However calls to alltoallv also increment the other counters.

edit:
and the same for alltoallw

Adds alltoallv to coll.sync analogous to other collectives.
Two additional control variables: barrier_before_alltoallv and
barrier_after_alltoallv analogous to barrier_before and barrier_after,
respectively. This way sync may be activated for alltoallv only.
However calls to alltoallv also increment the other counters.

Signed-off-by: Harald Braun harald.braun@atos.net
Signed-off-by: Harald Braun harald.braun@atos.net
@haraldBraun-atos haraldBraun-atos changed the title Alltoallv small to medium size messages desync, if used in gather-alike manner coll/sync: add support for alltoallv and alltoallw Apr 13, 2017
@hppritcha hppritcha requested a review from edgargabriel March 21, 2018 15:32
@hppritcha
Copy link
Member

@DrHaraldBraun do you have an app that is helped by having this sync option for alltoallv/w?

@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@hppritcha
Copy link
Member

okay to test

@hppritcha hppritcha requested review from gpaulsen and removed request for edgargabriel July 16, 2018 19:16
@hppritcha
Copy link
Member

@gpaulsen

@edgargabriel
Copy link
Member

@hppritcha My understanding of the status of this pr was that we could not identify a technical reason why this patch would be necessary, since Alltoall(v/w) act synchronizing anywyay. That's why we asked for an example from the author where this was an issue.

Copy link
Member

@gpaulsen gpaulsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of this change in general. coll_sync can be quite useful in a number of diagnostic situations, even for synchronizing collectives.
Just a few minor issues around the new mca parameters compatibility with existing parameters.

OBJ_RELEASE(module->c_coll.coll_reduce_scatter_module);
OBJ_RELEASE(module->c_coll.coll_scatter_module);
OBJ_RELEASE(module->c_coll.coll_scatterv_module);
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be an 'else' just a close bracket. Couldn't someone run with both? Of course we don't want to obj_release twice, so might need to rework the logic a bit.

sync_module->super.coll_allgatherv = NULL;
sync_module->super.coll_allreduce = NULL;
sync_module->super.coll_alltoall = NULL;
if (0 != mca_coll_sync_component.barrier_before_nops_alltoallv ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, can users only use barrier_[before|after]_nops_alltoall[v|w] if they don't use barrier_before_nops mechanism?

CHECK_AND_RETAIN(reduce_scatter);
CHECK_AND_RETAIN(scatter);
CHECK_AND_RETAIN(scatterv);
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, perhaps just a close bracket instead of the 'else', and some protection to only CHECK_AND_RETAIN alltoall[v|w] once if it goes through both sections of code.

@gpaulsen
Copy link
Member

@DrHaraldBraun, are you still interested in this? Would you mind if I cleaned this up?

@awlauria
Copy link
Contributor

@gpaulsen / @DrHaraldBraun do we still want this?

@awlauria awlauria added this to the v5.0.0 milestone Mar 6, 2020
@awlauria
Copy link
Contributor

awlauria commented Mar 6, 2020

@gpaulsen let's discuss the fate of this for 5.0.

@awlauria
Copy link
Contributor

Re-ping. 5.0 branching is targeted for April 30th. If you want this in for 5.0, please target to get it in master by then. Thanks!

@gpaulsen
Copy link
Member

I'll re-review this and try to get in for v5.0.0

@gpaulsen gpaulsen self-assigned this Feb 19, 2021
gpaulsen added a commit to gpaulsen/ompi that referenced this pull request Mar 15, 2021
Rebased version of @DrHaraldBraun's PR open-mpi#3262

[PATCH 1/2] alltoallv desync fix via coll.sync component
[PATCH 2/2] coll/sync: add alltoallw analoguous to alltoallv

Adds alltoallv to coll.sync analogous to other collectives.
Two additional control variables: barrier_before_alltoallv and
barrier_after_alltoallv analogous to barrier_before and barrier_after,
respectively. This way sync may be activated for alltoallv only.
However calls to alltoallv also increment the other counters.

Signed-off-by: Harald Braun harald.braun@atos.net
@awlauria awlauria changed the title coll/sync: add support for alltoallv and alltoallw coll/sync: add support for alltoallv and alltoallw Jul 2, 2021
@awlauria awlauria changed the title coll/sync: add support for alltoallv and alltoallw coll/sync: add support for alltoallv and alltoallw Oct 25, 2021
@jsquyres jsquyres deleted the branch open-mpi:master March 25, 2022 15:21
@jsquyres jsquyres closed this Mar 25, 2022
@jsquyres
Copy link
Member

It looks like this PR was auto-closed when we renamed the master branch to main. I'm not sure why GitHub auto-closed this PR due to the renaming -- all other open PRs were automatically re-targeted to main. ☹️

Can you re-open and/or re-submit this to the main branch?

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR makes the support for sync alltoall[v,w] widely different from the others collectives. I can't see why and there are no comments justifying this need. We need either to fix the code or justify the need to have a different implementation.

However, there is a bigger issue with this component. The in_operation field protects only in a single thread scenario, but can lead to non-deterministic applications if used in a multithreaded application. The sync should disqualify itself if the library is working in THREAD_MULTIPLE mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Alltoallv small to medium size messages desync, if used in gather-alike manner
9 participants