-
Notifications
You must be signed in to change notification settings - Fork 99
Merge dot products in PIPECG #1908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
part 1/2. Still need to look at the algorithm itself.
f24265c
to
23a35a3
Compare
Addressed the suggestions yesterday, rebased now. |
23a35a3
to
4c7a3d5
Compare
b3aee64
to
c2be7cb
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #1908 +/- ##
===========================================
+ Coverage 88.74% 89.11% +0.36%
===========================================
Files 857 857
Lines 71844 71839 -5
===========================================
+ Hits 63761 64020 +259
+ Misses 8083 7819 -264 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general.
- use submatrix creation
Performance concern: - z1 z2 can be one vector after we have conjugate spmv
- some access will be strided access
core/solver/pipe_cg.cpp
Outdated
|
||
LocalVector* rw = this->template create_workspace_op<LocalVector>( | ||
GKO_SOLVER_TRAITS::rw, conjoined_size); | ||
auto r_unique = LocalVector::create( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use create_submatrix for right setup.
second one create array view include the non-existed memory
core/solver/pipe_cg.cpp
Outdated
rw->get_values()), | ||
b_stride * 2); | ||
auto* r = r_unique.get(); | ||
auto w_unique = LocalVector::create( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here and following
beta, gko::detail::get_local(p), gko::detail::get_local(q), | ||
gko::detail::get_local(f), gko::detail::get_local(g), | ||
gko::detail::get_local(z), gko::detail::get_local(w), | ||
gko::detail::get_local(z1), gko::detail::get_local(w), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note z1 is strided access which might lower your performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think there's a way to avoid this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently no under this storage. You need change z storage and it will also require change the other to fit the need
reference/solver/pipe_cg_kernels.cpp
Outdated
z1->at(i, j) -= tmp * f->at(i, j); | ||
z2->at(i, j) -= tmp * f->at(i, j); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because z2 is always sync with z1, z2 = z1 should be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it still use z2 -= tmp * f
ab40d16
to
995969d
Compare
6a73634
to
66f4ebe
Compare
66f4ebe
to
1da0064
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! LGTM!
7c24aea
to
781631d
Compare
@pratikvn thanks and also thanks for helping me! I rebased the PR |
Is it ready to review? and do you have performance comparison? |
781631d
to
c510304
Compare
// GKO_SOLVER_VECTOR(r, dense_b); | ||
// GKO_SOLVER_VECTOR(w, dense_b); | ||
// into rw that we later slice for efficient dot product computation | ||
auto b_stride = dense_b->get_stride(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need to use the stride from b?
the rw vectors actually have 2 * b_stride not b_stride.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, b_stride is used, say, here:
auto* r = r_unique.get();
auto w_unique = rw->create_submatrix(
local_span{0, local_original_size[0]},
local_span{b_stride, b_stride + local_original_size[1]},
global_original_size);
auto* w = w_unique.get();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we multiply by 2 when needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I know where b_stride is from.
My question is more on whether to use b_stride or use the #vectors as stride.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as Pratik told me today, the b_stride is fine bc of padding and it's probably equivalent to #vectors most of the time. I used b_stride because dense_b is a vector so mathematically it made sense to me to use the dimensions of the original vector as reference for a size variable used to create other vectors
reference/solver/pipe_cg_kernels.cpp
Outdated
z1->at(i, j) -= tmp * f->at(i, j); | ||
z2->at(i, j) -= tmp * f->at(i, j); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it still use z2 -= tmp * f
test/solver/solver.cpp
Outdated
struct PipeCg : SimpleSolverTest<gko::solver::PipeCg<solver_value_type>> { | ||
static double tolerance() { return 1e7 * r<value_type>::value; } | ||
|
||
static constexpr bool will_not_allocate() { return false; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why will it allocate some in the second round?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean by "the second round"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when will_not_allocate
is false, ApplyDoesntAllocateRepeatedly will be skipped.
The test is to check whether we have additional allocation in the second round of apply.
The first apply usually has the workspace apply but we try to avoid allocate/free in the second round of apply
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove this so that we can see and debug if possible unwanted behaviour here
core/solver/pipe_cg.cpp
Outdated
auto& reduction_tmp = this->template create_workspace_array<char>( | ||
GKO_SOLVER_TRAITS::tmp, 2 * global_original_size[1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto& reduction_tmp = this->template create_workspace_array<char>( | |
GKO_SOLVER_TRAITS::tmp, 2 * global_original_size[1]); | |
auto& reduction_tmp = this->template create_workspace_array<char>( | |
GKO_SOLVER_TRAITS::tmp); |
reduction tmp will need to go through the reduction step to know proper size, so we do not need to initialize a size for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work! The code looks good to me.
I am still not sure whether it gives the performance in practice by this version between the stride and communication cost.
|
the typo check fails but I don't think it has anything to do with this PR, so I'm not sure if this needs to be fixed here |
I think you can ignore the typo check now. I will create another pr to fix them |
This PR continues on the work done in PIPECG by making it fulfill its initial purpose of reducing the number of dot products in the algorithm by merging them into a singe dot operation.