Skip to content

Conversation

@MarcelKoch
Copy link
Member

@MarcelKoch MarcelKoch commented Jan 31, 2022

This PR adds a distributed (multi-)vector class. I've opted into using mostly multivector based descriptions instead of using dense for that. This might clash a bit with the rest, until we separate dense into multivector and dense matrix.

The vector is a distributed global vector, i.e. it has the size global_num_rows x num_cols. Only the rows are distributed. Each process stores only the rows that belong to itself in Dense matrix, according to the used partition.

Regarding the tests, I've put the tests that require MPI into /core/test/mpi/distributed, not sure if that is the best place.

Partially addresses #907

Note: should we use distributed-develop as target branch again? That way we could make sure that we have all basic things (vectors, matrices, solvers) together for the next release.

Main contributions are from @upsj and @pratikvn.

@MarcelKoch MarcelKoch added is:new-feature A request or implementation of a feature that does not exist yet. 1:ST:ready-for-review This PR is ready for review mod:mpi This is related to the MPI module type:distributed-functionality labels Jan 31, 2022
@MarcelKoch MarcelKoch added this to the Ginkgo 1.5.0 milestone Jan 31, 2022
@MarcelKoch MarcelKoch self-assigned this Jan 31, 2022
@ginkgo-bot ginkgo-bot added mod:all This touches all Ginkgo modules. reg:build This is related to the build system. reg:testing This is related to testing. type:matrix-format This is related to the Matrix formats labels Jan 31, 2022
@MarcelKoch
Copy link
Member Author

format-rebase!

@ginkgo-bot
Copy link
Member

Error: Rebase failed, see the related Action for details

@MarcelKoch
Copy link
Member Author

format!

@MarcelKoch MarcelKoch requested a review from a team February 3, 2022 15:05
Copy link
Member

@upsj upsj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first look over the code. The class structure and implementation looks good to me, haven't checked the tests yet.

A fundamental discussion point I would like to raise is whether Vector should store Partition. In my original design, both Matrix and Vector exclusively used local indexing after they have been filled, and I believe this is a property we could use generally. If users need to translate global indices and local indices, they have the original partition available from their setup code. Using the vector and matrix as a convenient vehicle to carry them around seems like it might overload the functionality a bit.

@MarcelKoch
Copy link
Member Author

Regarding the partition, my use case for that was repartitioning, which we should support in the long term. This could be achieved without storing the partition in the vector class, since the repartition would need to store it anyway. However, I don't see how we could make sure that the vector we are repartitioning is valid for that operation, if the vector does not store its partition. I think the additional safety we get from storing it is worth the downside. (I'm not entirely sure what you see as a downside, except that it is not necessary, perhaps you could elaborate that?)

Also, I think there is a strong connection between vector and partition. The vector is only valid in combination with this one partition, which is expressed in that way.

Copy link
Member

@pratikvn pratikvn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good. Some minor questions on function interfaces.

@pratikvn
Copy link
Member

pratikvn commented Feb 3, 2022

Regarding the base class to merge this into, distributed-develop might be a good choice, unless we decide that the interface more or less looks good, then we can try to merge this directly into develop

I think storing a partition in Vector also allows us to only Vector operations on data (with a different partition), without access to matrix or other partition storing classes. Therefore, I also think it would be useful to store the partition in the vector class.

@MarcelKoch MarcelKoch changed the base branch from develop to distributed-develop February 9, 2022 13:29
@MarcelKoch
Copy link
Member Author

I've worked in most of the reviews. There are two open questions so far (1 minor, 1 major):

  1. Do agree on the current order of the constructor parameters and their default values?
  2. Do we keep the partition in the vector?

For 2. the arguments against (as I understand it) are:

  • There is no use case where the partition information is required, but won't be provided elsewhere. This especially holds for the SpMV

The arguments in favor are:

  • Can ensure that we always operate on consistent data, e.g. SpMV. For example, we could have cases, where the local and global sizes all match up, but the partitions are not the same. (Note: this would require testing partitions for equality in constant time)
  • Simple access to the partition, without relying on other classes that might store it.

@MarcelKoch
Copy link
Member Author

I've added the DenseCache that is used in the distributed matrix also to this PR. If wanted, I could also extract that into its own PR, since the changes for that are quite small.

Copy link
Member

@pratikvn pratikvn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good. Some comments on the whether we want to make DenseCache public. Regarding the discussion points:

  1. I think the constructor order is okay.
  2. I think a distributed object should have always have partition member.
    • It should be something like a Concept require for distributed objects.
    • There are cases for vector operations, where the partition makes sense. For example dot products between two vectors.
    • Additionally, the read function already takes a partition. In my opinion, it might make more sense to not have the read function take a partition, but create the object with a partition and the read function use the object's partition.

@MarcelKoch
Copy link
Member Author

@pratikvn you raise a very important point. The current implentations are only correct, if both vectors (think axpy, dot, ...) use the same partition. It is not enough that both local vectors have the same size. This is something that we should check for at some point.

@upsj
Copy link
Member

upsj commented Feb 10, 2022

I see, there seems to be a very fundamental difference between our approaches. In my original implementation, neither Matrix nor Vector store their partition. If you ignore the off-diagonal entries, you can imagine the matrix as a block-diagonal matrix with local entries in each row. I agree that having the partition available is useful in some cases, but in that case the current implementation has similar issues to what #753 is fixing: Copying the vector to another executor doesn't copy the Partition, which leads to issues when trying to read it from a kernel.

A very basic rule I try to follow in my designs is to try and keep the amount of state of each object as minimal as possible, and always prefer members with value semantics. This leads to the least weird empty states for objects. For communicator and executor it is strictly necessary to have them available to each object (also, we need to check that objects share the same commuicator before having them interact!), but all current functionality of Vector and Matrix (pointwise operations, reductions, SpMV) can be implemented entirely without knowing the partition. You can imagine the read_distributed function doing a global reordering that groups rows/columns from the same block together. The whole point of the interface is for the partition to become unnecessary once you have everything set up. I would assume that applications also tend to work on local indices rather than global indices? For more advanced usage of the halo, maybe we should consider adding a matrix type that takes care of overlaps similar to the P/P^T matrix in libCEED's L-Vector and T-Vector? L-Vector and T-Vector

@pratikvn
Copy link
Member

While I agree that minimizing state stored by a class is a good idea, I think it is important to have enough information to enable the class to be independent, without having to rely on other classes to provide valid information on how the data is being stored in this class. To me, a partition is crucial to a distributed object and provides information on the object's distribution. Regarding the issues you raise:

  1. If copying across executors does not copy all the data of that class, then we need to either try to make sure we overload the copy_from to handle that, or disable copy_from and use the = operators with the correct behaviour.
  2. Similar to the communicator, we technically do need to verify that the partition of two interacting objects are the same or atleast commute with each other. Example for dot products between two vectors, which can be stored in different ways on different ranks. Right now, we just assume they have the same local sizes. Without a partition object, you disable dot products between vectors even if they have the same global size.
  3. The T-vector/L-vector is a good point, but that is also something the partition/index set provides, global to local and local to global maps.

@upsj
Copy link
Member

upsj commented Feb 10, 2022

I tend to disagree with the premise that a vector needs to contain information that allows you to understand what its row indices mean. Take Dense for example, without context (e.g. a FEM mesh), you don't know how to interpret the different entries. That context (mapping mesh entries to DOFs) is provided by the application and not stored in the vector. I see the distributed vector in the same way: The application provides a way to interpret what the local indices mean based on its FEM mesh and partition.

Also a side note we haven't talked about: For the most efficient way to store a partition (i.e. one contiguous range for each rank), the local size is entirely sufficient to check for consistency (assuming the default MPI configuration to quit if one rank aborts) Instead of checking the global property "matching partition" everywhere, you check the local property "matching size" everywhere, which is equivalent, because in this case, the read_distributed function doesn't do any (implicit) global reordering.

Finally, I may not have made the fine distinction between our current approach and L/T vectors clear enough:

  • Our approach partitions the matrix by row only, storing the off-diagonal block separately and communicating only via a sparse gather before applying the off-diagonal matrix.
  • The CEED approach symmetrically stores a slightly larger matrix (local row + overlap) in that it extends both the rows and columns by its set of off-diagonal columns (that of course only works for pattern-symmetric matrices) and communicates both via a sparse gather before the (in this case) single SpMV and via a sparse scatter-add afterwards (to collect contributions from neighboring elements on the boundary). That means more communication, but also makes the local handling much nicer, because you basically get a local domain made out of complete elements (I am not 100% sure about the details of it, but I guess with your experience in RAS and domain composition, this might sound understandable/familiar?)
    The main difference is that there is no longer a separation between diagonal and off-diagonal matrix, and you can imagine the P and P^T matrices as communication operators that do the gather/scatter operations.

EDIT: As a final thought, generally, it is always easier to add something later (store the partition after read_distributed) than it is to remove it later, since the latter involves an interface break. Because IMO the repartitioning is a really advanced use case, I would like to see that handled separately at least, because there may be smoother ways to express it?

EDIT2: I just remembered another issue: If we store the partition in Vector and Matrix, we also need to propagate it to all intermediate vectors in solvers, as well as distributed preconditioner wrappers. That functionality might impact a lot of classes with distributed-specific code.

EDIT3: After some discussions with Natalie, I believe the CEED case is pretty FEM-specific, since each nonzero involving the boundary of an element is computed from the sum of multiple submatrices involving the neighboring elements (see "E vector"), and such a decomposition into individual submatrices may not exist/be well-defined in the general case?

@codecov
Copy link

codecov bot commented Feb 11, 2022

Codecov Report

Merging #961 (72944ef) into distributed-develop (cff196b) will decrease coverage by 1.15%.
The diff coverage is 92.85%.

Impacted file tree graph

@@                   Coverage Diff                   @@
##           distributed-develop     #961      +/-   ##
=======================================================
- Coverage                93.38%   92.23%   -1.16%     
=======================================================
  Files                      479      489      +10     
  Lines                    39702    40593     +891     
=======================================================
+ Hits                     37077    37439     +362     
- Misses                    2625     3154     +529     
Impacted Files Coverage Δ
core/device_hooks/common_kernels.inc.cpp 0.00% <0.00%> (ø)
core/test/utils.hpp 100.00% <ø> (ø)
include/ginkgo/core/base/mpi.hpp 92.30% <0.00%> (-3.62%) ⬇️
include/ginkgo/core/base/types.hpp 92.59% <ø> (ø)
include/ginkgo/core/distributed/partition.hpp 100.00% <ø> (+9.09%) ⬆️
include/ginkgo/core/matrix/dense.hpp 96.32% <ø> (-0.74%) ⬇️
test/utils/executor.hpp 22.22% <20.00%> (-2.78%) ⬇️
include/ginkgo/core/distributed/base.hpp 80.00% <80.00%> (ø)
core/distributed/vector.cpp 83.13% <83.13%> (ø)
include/ginkgo/core/base/dense_cache.hpp 88.88% <88.88%> (ø)
... and 36 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cff196b...72944ef. Read the comment docs.

MarcelKoch and others added 7 commits March 9, 2022 11:13
- small rename
- documentation
- cmake
- tests

Co-authored-by: Yuhsiang Tsai <yhmtsai@gmail.com>
Co-authored-by: Yuhsiang Tsai <yhmtsai@gmail.com>
Co-authored-by: Tobias Ribizel <ribizel@kit.edu>
- formating
- fix DenseCache::init_from
- fix tests if comm.size != 3

Co-authored-by: Yuhsiang Tsai <yhmtsai@gmail.com>
this adds in turn mutable access through get_local_values and at_local

Co-authored-by: Tobias Ribizel <ribizel@kit.edu>
@MarcelKoch MarcelKoch force-pushed the distributed-vector branch from 6e9016e to 72944ef Compare March 9, 2022 10:14
@ginkgo-bot
Copy link
Member

Note: This PR changes the Ginkgo ABI:

Functions changes summary: 0 Removed, 0 Changed (40 filtered out), 48 Added functions
Variables changes summary: 0 Removed, 0 Changed, 0 Added variable

For details check the full ABI diff under Artifacts here

@sonarqubecloud
Copy link

sonarqubecloud bot commented Mar 9, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 50 Code Smells

87.9% 87.9% Coverage
4.1% 4.1% Duplication

@MarcelKoch MarcelKoch merged commit 78a8ae7 into distributed-develop Mar 10, 2022
@MarcelKoch MarcelKoch deleted the distributed-vector branch March 10, 2022 10:05
@MarcelKoch MarcelKoch restored the distributed-vector branch April 21, 2022 11:04
MarcelKoch added a commit that referenced this pull request Apr 21, 2022
This PR adds support for a (row-wise) distributed (multi-)vector. It supports most operation of the dense class. These vector operations are supported on all devices that support the corresponding dense operation. Only the initialization through `read_distributed` is only supported on reference and openmp.

Related PR: #961
Related PR: #1030
MarcelKoch added a commit that referenced this pull request May 4, 2022
This PR adds support for a (row-wise) distributed (multi-)vector. It supports most operation of the dense class. These vector operations are supported on all devices that support the corresponding dense operation. Only the initialization through `read_distributed` is only supported on reference and openmp.

Related PR: #961
Related PR: #1030
MarcelKoch added a commit that referenced this pull request May 23, 2022
This PR adds support for a (row-wise) distributed (multi-)vector. It supports most operation of the dense class. These vector operations are supported on all devices that support the corresponding dense operation. Only the initialization through `read_distributed` is only supported on reference and openmp.

Related PR: #961
Related PR: #1030
MarcelKoch added a commit that referenced this pull request Jun 2, 2022
This PR adds support for a (row-wise) distributed (multi-)vector. It supports most operation of the dense class. These vector operations are supported on all devices that support the corresponding dense operation. Only the initialization through `read_distributed` is only supported on reference and openmp.

Related PR: #961
Related PR: #1030
MarcelKoch added a commit that referenced this pull request Jul 8, 2022
This PR adds support for a (row-wise) distributed (multi-)vector. It supports most operation of the dense class. These vector operations are supported on all devices that support the corresponding dense operation. Only the initialization through `read_distributed` is only supported on reference and openmp.

Related PR: #961
Related PR: #1030
MarcelKoch added a commit that referenced this pull request Aug 16, 2022
This PR adds support for a (row-wise) distributed (multi-)vector. It supports most operation of the dense class. These vector operations are supported on all devices that support the corresponding dense operation. Only the initialization through `read_distributed` is only supported on reference and openmp.

Related PR: #961
Related PR: #1030
MarcelKoch added a commit that referenced this pull request Sep 28, 2022
This PR will enable using distributed matrices and vector (#971 and #961) in the following iterative solvers:
- Bicgstab
- Cg
- Cgs
- Fcg
- Ir

Currently not supported are:
- Bicg
- [cb_]Gmres
- Idr
- Multigrid
- Lower/Upper_trs

The handling of the distributed/non-distributed data is done via additional dispatch routines that expand on precision_dispatch_real_complex, and helper routines to extract the underlying dense matrix from either a distributed or dense vector. Also, the residual norm stopping criteria implementation has been changed to also use a similar dispatch approach.

This also contains some fixes regarding the doxygen documentation for the other distributed classes.

Related PR: #976
MarcelKoch added a commit that referenced this pull request Oct 5, 2022
This PR adds support for a (row-wise) distributed (multi-)vector. It supports most operation of the dense class. These vector operations are supported on all devices that support the corresponding dense operation. Only the initialization through `read_distributed` is only supported on reference and openmp.

Related PR: #961
Related PR: #1030
MarcelKoch added a commit that referenced this pull request Oct 5, 2022
This PR will enable using distributed matrices and vector (#971 and #961) in the following iterative solvers:
- Bicgstab
- Cg
- Cgs
- Fcg
- Ir

Currently not supported are:
- Bicg
- [cb_]Gmres
- Idr
- Multigrid
- Lower/Upper_trs

The handling of the distributed/non-distributed data is done via additional dispatch routines that expand on precision_dispatch_real_complex, and helper routines to extract the underlying dense matrix from either a distributed or dense vector. Also, the residual norm stopping criteria implementation has been changed to also use a similar dispatch approach.

This also contains some fixes regarding the doxygen documentation for the other distributed classes.

Related PR: #976
MarcelKoch added a commit that referenced this pull request Oct 26, 2022
This PR adds support for a (row-wise) distributed (multi-)vector. It supports most operation of the dense class. These vector operations are supported on all devices that support the corresponding dense operation. Only the initialization through `read_distributed` is only supported on reference and openmp.

Related PR: #961
Related PR: #1030
MarcelKoch added a commit that referenced this pull request Oct 26, 2022
This PR will enable using distributed matrices and vector (#971 and #961) in the following iterative solvers:
- Bicgstab
- Cg
- Cgs
- Fcg
- Ir

Currently not supported are:
- Bicg
- [cb_]Gmres
- Idr
- Multigrid
- Lower/Upper_trs

The handling of the distributed/non-distributed data is done via additional dispatch routines that expand on precision_dispatch_real_complex, and helper routines to extract the underlying dense matrix from either a distributed or dense vector. Also, the residual norm stopping criteria implementation has been changed to also use a similar dispatch approach.

This also contains some fixes regarding the doxygen documentation for the other distributed classes.

Related PR: #976
MarcelKoch added a commit that referenced this pull request Oct 31, 2022
This PR adds support for a (row-wise) distributed (multi-)vector. It supports most operation of the dense class. These vector operations are supported on all devices that support the corresponding dense operation. Only the initialization through `read_distributed` is only supported on reference and openmp.

Related PR: #961
Related PR: #1030
MarcelKoch added a commit that referenced this pull request Oct 31, 2022
This PR will enable using distributed matrices and vector (#971 and #961) in the following iterative solvers:
- Bicgstab
- Cg
- Cgs
- Fcg
- Ir

Currently not supported are:
- Bicg
- [cb_]Gmres
- Idr
- Multigrid
- Lower/Upper_trs

The handling of the distributed/non-distributed data is done via additional dispatch routines that expand on precision_dispatch_real_complex, and helper routines to extract the underlying dense matrix from either a distributed or dense vector. Also, the residual norm stopping criteria implementation has been changed to also use a similar dispatch approach.

This also contains some fixes regarding the doxygen documentation for the other distributed classes.

Related PR: #976
MarcelKoch added a commit that referenced this pull request Oct 31, 2022
This PR will add basic, distributed data structures (matrix and vector), and enable some solvers for these types. This PR contains the following PRs:
- #961
- #971 
- #976 
- #985 
- #1007 
- #1030 
- #1054

# Additional Changes

- moves new types into experimental namespace
- moves existing Partition class into experimental namespace
- moves existing mpi namespace into experimental namespace
- makes generic_scoped_device_id_guard destructor noexcept by terminating if restoring the original device id fails
- switches to blocking communication in the SpMV if OpenMPI version 4.0.x is used
- disables Horeka mpi tests and uses nla-gpu instead

Related PR: #1133
tcojean pushed a commit that referenced this pull request Nov 12, 2022
Advertise release 1.5.0 and last changes

+ Add changelog,
+ Update third party libraries
+ A small fix to a CMake file

See PR: #1195

The Ginkgo team is proud to announce the new Ginkgo minor release 1.5.0. This release brings many important new features such as:
- MPI-based multi-node support for all matrix formats and most solvers;
- full DPC++/SYCL support,
- functionality and interface for GPU-resident sparse direct solvers,
- an interface for wrapping solvers with scaling and reordering applied,
- a new algebraic Multigrid solver/preconditioner,
- improved mixed-precision support,
- support for device matrix assembly,

and much more.

If you face an issue, please first check our [known issues page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues) and the [open issues list](https://github.com/ginkgo-project/ginkgo/issues) and if you do not find a solution, feel free to [open a new issue](https://github.com/ginkgo-project/ginkgo/issues/new/choose) or ask a question using the [github discussions](https://github.com/ginkgo-project/ginkgo/discussions).

Supported systems and requirements:
+ For all platforms, CMake 3.13+
+ C++14 compliant compiler
+ Linux and macOS
  + GCC: 5.5+
  + clang: 3.9+
  + Intel compiler: 2018+
  + Apple LLVM: 8.0+
  + NVHPC: 22.7+
  + Cray Compiler: 14.0.1+
  + CUDA module: CUDA 9.2+ or NVHPC 22.7+
  + HIP module: ROCm 4.0+
  + DPC++ module: Intel OneAPI 2021.3 with oneMKL and oneDPL. Set the CXX compiler to `dpcpp`.
+ Windows
  + MinGW and Cygwin: GCC 5.5+
  + Microsoft Visual Studio: VS 2019
  + CUDA module: CUDA 9.2+, Microsoft Visual Studio
  + OpenMP module: MinGW or Cygwin.


Algorithm and important feature additions:
+ Add MPI-based multi-node for all matrix formats and solvers (except GMRES and IDR). ([#676](#676), [#908](#908), [#909](#909), [#932](#932), [#951](#951), [#961](#961), [#971](#971), [#976](#976), [#985](#985), [#1007](#1007), [#1030](#1030), [#1054](#1054), [#1100](#1100), [#1148](#1148))
+ Porting the remaining algorithms (preconditioners like ISAI, Jacobi, Multigrid, ParILU(T) and ParIC(T)) to DPC++/SYCL, update to SYCL 2020, and improve support and performance ([#896](#896), [#924](#924), [#928](#928), [#929](#929), [#933](#933), [#943](#943), [#960](#960), [#1057](#1057), [#1110](#1110),  [#1142](#1142))
+ Add a Sparse Direct interface supporting GPU-resident numerical LU factorization, symbolic Cholesky factorization, improved triangular solvers, and more ([#957](#957), [#1058](#1058), [#1072](#1072), [#1082](#1082))
+ Add a ScaleReordered interface that can wrap solvers and automatically apply reorderings and scalings ([#1059](#1059))
+ Add a Multigrid solver and improve the aggregation based PGM coarsening scheme ([#542](#542), [#913](#913), [#980](#980), [#982](#982),  [#986](#986))
+ Add infrastructure for unified, lambda-based, backend agnostic, kernels and utilize it for some simple kernels ([#833](#833), [#910](#910), [#926](#926))
+ Merge different CUDA, HIP, DPC++ and OpenMP tests under a common interface ([#904](#904), [#973](#973), [#1044](#1044), [#1117](#1117))
+ Add a device_matrix_data type for device-side matrix assembly ([#886](#886), [#963](#963), [#965](#965))
+ Add support for mixed real/complex BLAS operations ([#864](#864))
+ Add a FFT LinOp for all but DPC++/SYCL ([#701](#701))
+ Add FBCSR support for NVIDIA and AMD GPUs and CPUs with OpenMP ([#775](#775))
+ Add CSR scaling ([#848](#848))
+ Add array::const_view and equivalent to create constant matrices from non-const data ([#890](#890))
+ Add a RowGatherer LinOp supporting mixed precision to gather dense matrix rows ([#901](#901))
+ Add mixed precision SparsityCsr SpMV support ([#970](#970))
+ Allow creating CSR submatrix including from (possibly discontinuous) index sets ([#885](#885), [#964](#964))
+ Add a scaled identity addition (M <- aI + bM) feature interface and impls for Csr and Dense ([#942](#942))


Deprecations and important changes:
+ Deprecate AmgxPgm in favor of the new Pgm name. ([#1149](#1149)).
+ Deprecate specialized residual norm classes in favor of a common `ResidualNorm` class ([#1101](#1101))
+ Deprecate CamelCase non-polymorphic types in favor of snake_case versions (like array, machine_topology, uninitialized_array, index_set) ([#1031](#1031), [#1052](#1052))
+ Bug fix: restrict gko::share to rvalue references (*possible interface break*) ([#1020](#1020))
+ Bug fix: when using cuSPARSE's triangular solvers, specifying the factory parameter `num_rhs` is now required when solving for more than one right-hand side, otherwise an exception is thrown ([#1184](#1184)).
+ Drop official support for old CUDA < 9.2 ([#887](#887))


Improved performance additions:
+ Reuse tmp storage in reductions in solvers and add a mutable workspace to all solvers ([#1013](#1013), [#1028](#1028))
+ Add HIP unsafe atomic option for AMD ([#1091](#1091))
+ Prefer vendor implementations for Dense dot, conj_dot and norm2 when available ([#967](#967)).
+ Tuned OpenMP SellP, COO, and ELL SpMV kernels for a small number of RHS ([#809](#809))


Fixes:
+ Fix various compilation warnings ([#1076](#1076), [#1183](#1183), [#1189](#1189))
+ Fix issues with hwloc-related tests ([#1074](#1074))
+ Fix include headers for GCC 12 ([#1071](#1071))
+ Fix for simple-solver-logging example ([#1066](#1066))
+ Fix for potential memory leak in Logger ([#1056](#1056))
+ Fix logging of mixin classes ([#1037](#1037))
+ Improve value semantics for LinOp types, like moved-from state in cross-executor copy/clones ([#753](#753))
+ Fix some matrix SpMV and conversion corner cases ([#905](#905), [#978](#978))
+ Fix uninitialized data ([#958](#958))
+ Fix CUDA version requirement for cusparseSpSM ([#953](#953))
+ Fix several issues within bash-script ([#1016](#1016))
+ Fixes for `NVHPC` compiler support ([#1194](#1194))


Other additions:
+ Simplify and properly name GMRES kernels ([#861](#861))
+ Improve pkg-config support for non-CMake libraries ([#923](#923), [#1109](#1109))
+ Improve gdb pretty printer ([#987](#987), [#1114](#1114))
+ Add a logger highlighting inefficient allocation and copy patterns ([#1035](#1035))
+ Improved and optimized test random matrix generation ([#954](#954), [#1032](#1032))
+ Better CSR strategy defaults ([#969](#969))
+ Add `move_from` to `PolymorphicObject` ([#997](#997))
+ Remove unnecessary device_guard usage ([#956](#956))
+ Improvements to the generic accessor for mixed-precision ([#727](#727))
+ Add a naive lower triangular solver implementation for CUDA ([#764](#764))
+ Add support for int64 indices from CUDA 11 onward with SpMV and SpGEMM ([#897](#897))
+ Add a L1 norm implementation ([#900](#900))
+ Add reduce_add for arrays ([#831](#831))
+ Add utility to simplify Dense View creation from an existing Dense vector ([#1136](#1136)).
+ Add a custom transpose implementation for Fbcsr and Csr transpose for unsupported vendor types ([#1123](#1123))
+ Make IDR random initilization deterministic ([#1116](#1116))
+ Move the algorithm choice for triangular solvers from Csr::strategy_type to a factory parameter ([#1088](#1088))
+ Update CUDA archCoresPerSM ([#1175](#1116))
+ Add kernels for Csr sparsity pattern lookup ([#994](#994))
+ Differentiate between structural and numerical zeros in Ell/Sellp ([#1027](#1027))
+ Add a binary IO format for matrix data ([#984](#984))
+ Add a tuple zip_iterator implementation ([#966](#966))
+ Simplify kernel stubs and declarations ([#888](#888))
+ Simplify GKO_REGISTER_OPERATION with lambdas ([#859](#859))
+ Simplify copy to device in tests and examples ([#863](#863))
+ More verbose output to array assertions ([#858](#858))
+ Allow parallel compilation for Jacobi kernels ([#871](#871))
+ Change clang-format pointer alignment to left ([#872](#872))
+ Various improvements and fixes to the benchmarking framework ([#750](#750), [#759](#759), [#870](#870), [#911](#911), [#1033](#1033), [#1137](#1137))
+ Various documentation improvements ([#892](#892), [#921](#921), [#950](#950), [#977](#977), [#1021](#1021), [#1068](#1068), [#1069](#1069), [#1080](#1080), [#1081](#1081), [#1108](#1108), [#1153](#1153), [#1154](#1154))
+ Various CI improvements ([#868](#868), [#874](#874), [#884](#884), [#889](#889), [#899](#899), [#903](#903),  [#922](#922), [#925](#925), [#930](#930), [#936](#936), [#937](#937), [#958](#958), [#882](#882), [#1011](#1011), [#1015](#1015), [#989](#989), [#1039](#1039), [#1042](#1042), [#1067](#1067), [#1073](#1073), [#1075](#1075), [#1083](#1083), [#1084](#1084), [#1085](#1085), [#1139](#1139), [#1178](#1178), [#1187](#1187))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

1:ST:ready-for-review This PR is ready for review is:new-feature A request or implementation of a feature that does not exist yet. mod:all This touches all Ginkgo modules. mod:mpi This is related to the MPI module reg:build This is related to the build system. reg:testing This is related to testing. type:distributed-functionality type:matrix-format This is related to the Matrix formats

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants