Skip to content

Conversation

gojakuch
Copy link
Collaborator

@gojakuch gojakuch commented Aug 6, 2025

This PR continues on the work done in PIPECG by making it fulfill its initial purpose of reducing the number of dot products in the algorithm by merging them into a singe dot operation.

@gojakuch gojakuch added the 1:ST:WIP This PR is a work in progress. Not ready for review. label Aug 6, 2025
@ginkgo-bot ginkgo-bot added reg:testing This is related to testing. mod:core This is related to the core module. mod:cuda This is related to the CUDA module. mod:reference This is related to the reference module. type:solver This is related to the solvers mod:hip This is related to the HIP module. labels Aug 6, 2025
@pratikvn pratikvn self-requested a review August 7, 2025 07:53
Copy link
Member

@pratikvn pratikvn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

part 1/2. Still need to look at the algorithm itself.

@gojakuch gojakuch force-pushed the feat/pipe-cg-dotmerge branch 2 times, most recently from f24265c to 23a35a3 Compare August 9, 2025 19:23
@gojakuch
Copy link
Collaborator Author

gojakuch commented Aug 9, 2025

Addressed the suggestions yesterday, rebased now.
It fails the test comparing the reference implementation to omp on my machine, which is very odd, as they seem to do the same thing. the reference implementation is now totally correct though, it passes the tests. somehow smth is wrong with step_1, still debugging it

@gojakuch gojakuch force-pushed the feat/pipe-cg-dotmerge branch from 23a35a3 to 4c7a3d5 Compare August 15, 2025 18:34
@gojakuch gojakuch requested a review from pratikvn August 15, 2025 18:36
@gojakuch gojakuch added 1:ST:ready-for-review This PR is ready for review 1:ST:run-full-test and removed 1:ST:WIP This PR is a work in progress. Not ready for review. labels Aug 18, 2025
@gojakuch gojakuch changed the title Merge the dot products in PIPECG Merge dot products in PIPECG Aug 18, 2025
@gojakuch gojakuch force-pushed the feat/pipe-cg-dotmerge branch from b3aee64 to c2be7cb Compare August 30, 2025 20:10
Copy link

codecov bot commented Sep 2, 2025

Codecov Report

❌ Patch coverage is 96.87500% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.11%. Comparing base (6e06b0a) to head (734b3fb).
⚠️ Report is 5 commits behind head on develop.

Files with missing lines Patch % Lines
core/solver/pipe_cg.cpp 93.87% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1908      +/-   ##
===========================================
+ Coverage    88.74%   89.11%   +0.36%     
===========================================
  Files          857      857              
  Lines        71844    71839       -5     
===========================================
+ Hits         63761    64020     +259     
+ Misses        8083     7819     -264     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@yhmtsai yhmtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general.

  • use submatrix creation
    Performance concern:
  • z1 z2 can be one vector after we have conjugate spmv
  • some access will be strided access


LocalVector* rw = this->template create_workspace_op<LocalVector>(
GKO_SOLVER_TRAITS::rw, conjoined_size);
auto r_unique = LocalVector::create(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use create_submatrix for right setup.
second one create array view include the non-existed memory

rw->get_values()),
b_stride * 2);
auto* r = r_unique.get();
auto w_unique = LocalVector::create(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here and following

beta, gko::detail::get_local(p), gko::detail::get_local(q),
gko::detail::get_local(f), gko::detail::get_local(g),
gko::detail::get_local(z), gko::detail::get_local(w),
gko::detail::get_local(z1), gko::detail::get_local(w),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note z1 is strided access which might lower your performance

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think there's a way to avoid this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently no under this storage. You need change z storage and it will also require change the other to fit the need

Comment on lines 103 to 104
z1->at(i, j) -= tmp * f->at(i, j);
z2->at(i, j) -= tmp * f->at(i, j);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because z2 is always sync with z1, z2 = z1 should be better

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it still use z2 -= tmp * f

yhmtsai
yhmtsai previously requested changes Sep 3, 2025
Copy link
Member

@pratikvn pratikvn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! LGTM!

@gojakuch gojakuch force-pushed the feat/pipe-cg-dotmerge branch from 7c24aea to 781631d Compare September 23, 2025 22:24
@gojakuch
Copy link
Collaborator Author

Nice work! LGTM!

@pratikvn thanks and also thanks for helping me! I rebased the PR

@yhmtsai
Copy link
Member

yhmtsai commented Sep 24, 2025

Is it ready to review? and do you have performance comparison?

@gojakuch gojakuch force-pushed the feat/pipe-cg-dotmerge branch from 781631d to c510304 Compare October 6, 2025 07:29
// GKO_SOLVER_VECTOR(r, dense_b);
// GKO_SOLVER_VECTOR(w, dense_b);
// into rw that we later slice for efficient dot product computation
auto b_stride = dense_b->get_stride();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to use the stride from b?
the rw vectors actually have 2 * b_stride not b_stride.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, b_stride is used, say, here:

auto* r = r_unique.get();
    auto w_unique = rw->create_submatrix(
        local_span{0, local_original_size[0]},
        local_span{b_stride, b_stride + local_original_size[1]},
        global_original_size);
    auto* w = w_unique.get();

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we multiply by 2 when needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know where b_stride is from.
My question is more on whether to use b_stride or use the #vectors as stride.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as Pratik told me today, the b_stride is fine bc of padding and it's probably equivalent to #vectors most of the time. I used b_stride because dense_b is a vector so mathematically it made sense to me to use the dimensions of the original vector as reference for a size variable used to create other vectors

Comment on lines 103 to 104
z1->at(i, j) -= tmp * f->at(i, j);
z2->at(i, j) -= tmp * f->at(i, j);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it still use z2 -= tmp * f

struct PipeCg : SimpleSolverTest<gko::solver::PipeCg<solver_value_type>> {
static double tolerance() { return 1e7 * r<value_type>::value; }

static constexpr bool will_not_allocate() { return false; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why will it allocate some in the second round?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by "the second round"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will_not_allocate is false, ApplyDoesntAllocateRepeatedly will be skipped.
The test is to check whether we have additional allocation in the second round of apply.
The first apply usually has the workspace apply but we try to avoid allocate/free in the second round of apply

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove this so that we can see and debug if possible unwanted behaviour here

Comment on lines 173 to 174
auto& reduction_tmp = this->template create_workspace_array<char>(
GKO_SOLVER_TRAITS::tmp, 2 * global_original_size[1]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto& reduction_tmp = this->template create_workspace_array<char>(
GKO_SOLVER_TRAITS::tmp, 2 * global_original_size[1]);
auto& reduction_tmp = this->template create_workspace_array<char>(
GKO_SOLVER_TRAITS::tmp);

reduction tmp will need to go through the reduction step to know proper size, so we do not need to initialize a size for that

Copy link
Member

@yhmtsai yhmtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! The code looks good to me.
I am still not sure whether it gives the performance in practice by this version between the stride and communication cost.

Copy link

@gojakuch
Copy link
Collaborator Author

@yhmtsai @pratikvn I'm flagging this as ready to merge then

@gojakuch gojakuch added 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels Oct 17, 2025
@gojakuch
Copy link
Collaborator Author

the typo check fails but I don't think it has anything to do with this PR, so I'm not sure if this needs to be fixed here

@yhmtsai
Copy link
Member

yhmtsai commented Oct 20, 2025

I think you can ignore the typo check now. I will create another pr to fix them

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

1:ST:ready-to-merge This PR is ready to merge. 1:ST:run-full-test mod:core This is related to the core module. mod:cuda This is related to the CUDA module. mod:hip This is related to the HIP module. mod:reference This is related to the reference module. reg:testing This is related to testing. type:solver This is related to the solvers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants