Skip to content

Conversation

MarcelKoch
Copy link
Member

@MarcelKoch MarcelKoch commented Feb 14, 2024

Based on #1544

This PR adds a class that handles the communication of a local vector. It uses the neigborhood non-blocking all-to-all for this. The sparse comm is constructed from an index_map.

In another PR this will be used in the distributed matrix.

Note: This uses C++17 to use type-erasure for the sparse_communicator.

PR Stack:

@MarcelKoch MarcelKoch self-assigned this Feb 14, 2024
@ginkgo-bot ginkgo-bot added reg:build This is related to the build system. reg:testing This is related to testing. mod:core This is related to the core module. labels Feb 14, 2024
@MarcelKoch
Copy link
Member Author

@upsj I think this PR might be a good opportunity to discuss your distributed::RowGather idea. In principle this class is already that, but not in the lin-op hierarchy.

Copy link
Member

@pratikvn pratikvn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this class structure mixes a few things together: the communicator class idea, with the specific dense communicate (involving a all_to_all_v). I think it would make sense to separate the two.

I think it should be possible to make the sparse_communicator a first class communicator and overload only those functions as necessary, for example, all_to_all_v in this case. The distributed object can then just use sparse communicator as a normal communicator and inherits the better all_to_all_v as it uses the sparse communicator for communication. For all other functions where the sparse communicator does not make sense (MPI_Send etc), you transparently use the underlying default communicator.

* Simplified MPI communicator that handles only neighborhood all-to-all
* communication.
*/
class sparse_communicator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it only provides a neighborhood all to all communication, then I think calling it a sparse communicator might be misleading.

Comment on lines +55 to +56
const detail::DenseCache<ValueType>& send_buffer,
const detail::DenseCache<ValueType>& recv_buffer) const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A public method of a public class should not be taking objects within the detail namespace. Probably need to generalize it to array or Dense ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that if the communicator stores the caches (which I also prefer) then the communicator can handle only one async communication at a time. Any new communication would need to wait until the internal buffers are ready to access.

Comment on lines +81 to +87
MPI_Info info;
GKO_ASSERT_NO_MPI_ERRORS(MPI_Info_create(&info));
GKO_ASSERT_NO_MPI_ERRORS(MPI_Dist_graph_create_adjacent(
comm.get(), send_ids.get_size(), send_ids.get_const_data(),
MPI_UNWEIGHTED, recv_ids.get_size(), recv_ids.get_const_data(),
MPI_UNWEIGHTED, info, false, &sparse_comm));
GKO_ASSERT_NO_MPI_ERRORS(MPI_Info_free(&info));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not

Suggested change
MPI_Info info;
GKO_ASSERT_NO_MPI_ERRORS(MPI_Info_create(&info));
GKO_ASSERT_NO_MPI_ERRORS(MPI_Dist_graph_create_adjacent(
comm.get(), send_ids.get_size(), send_ids.get_const_data(),
MPI_UNWEIGHTED, recv_ids.get_size(), recv_ids.get_const_data(),
MPI_UNWEIGHTED, info, false, &sparse_comm));
GKO_ASSERT_NO_MPI_ERRORS(MPI_Info_free(&info));
GKO_ASSERT_NO_MPI_ERRORS(MPI_Dist_graph_create_adjacent(
comm.get(), send_ids.get_size(), send_ids.get_const_data(),
MPI_UNWEIGHTED, recv_ids.get_size(), recv_ids.get_const_data(),
MPI_UNWEIGHTED, MPI_INFO_NULL, false, &sparse_comm));

since no key-value pairs are being set anyway ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One MPI implementation, I can't remember on which system, crashed without it, so I just kept it.

Comment on lines +196 to +199
mpi::request sparse_communicator::communicate(
const matrix::Dense<ValueType>* local_vector,
const detail::DenseCache<ValueType>& send_buffer,
const detail::DenseCache<ValueType>& recv_buffer) const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this now, I dont get why the sparse_communicator class handles this specific communicate function.

@MarcelKoch
Copy link
Member Author

@pratikvn you are right, the class mixes a few responsibilities. I will split it up into one class that handles the all-to-all communication and another one which does the distributed row gather.

@MarcelKoch MarcelKoch force-pushed the optional-cxx17 branch 2 times, most recently from c46d056 to 9546c60 Compare April 4, 2024 10:29
@MarcelKoch MarcelKoch force-pushed the sparse-comm branch 2 times, most recently from 1e3d482 to 63ba23d Compare April 5, 2024 08:14
@MarcelKoch MarcelKoch force-pushed the optional-cxx17 branch 2 times, most recently from 4c69493 to 2aa609c Compare April 5, 2024 08:18
@MarcelKoch MarcelKoch added this to the Ginkgo 1.8.0 milestone Apr 5, 2024
@MarcelKoch
Copy link
Member Author

This is superseded by #1588 and #1589

@MarcelKoch MarcelKoch closed this Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

mod:core This is related to the core module. reg:build This is related to the build system. reg:testing This is related to testing.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants