Skip to content

Conversation

gojakuch
Copy link
Collaborator

@gojakuch gojakuch commented Apr 9, 2025

This PR adds the reference implementation for the pipelined CG method.

@gojakuch gojakuch added mod:reference This is related to the reference module. type:solver This is related to the solvers 1:ST:ready-for-review This PR is ready for review labels Apr 9, 2025
@gojakuch
Copy link
Collaborator Author

gojakuch commented Apr 9, 2025

I've put the "Ready for review" label, but I'm not sure if I should've put the "Requires feedback" one instead, since this is my first PR here and it's hard for me to assess the state of it. any feedback would be considered of course :)

@pratikvn pratikvn self-requested a review April 10, 2025 08:47
Copy link
Member

@pratikvn pratikvn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Mostly looks good to me. Only the documentation needs to be updated.

Copy link
Member

@pratikvn pratikvn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks for making the changes. One last thing before we merge it is that you add yourself to the contributors list here after Marcel Koch: https://github.com/ginkgo-project/ginkgo/blob/develop/contributors.txt#L20 with the following info. LastName FirstName <email> Affiliation

Copy link
Member

@pratikvn pratikvn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for your contribution!

@pratikvn pratikvn added 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels Apr 11, 2025
Copy link
Member

@yhmtsai yhmtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add small section into test/test_install/test_install.cpp

#include "core/test/utils.hpp"


namespace {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
namespace {

we do not need the anonymous namespace for test now

}


} // namespace
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

this->small_n.get(), this->small_prev_rho.get(), this->small_rho.get(),
this->small_delta.get(), &this->small_stop);

GKO_ASSERT_MTX_NEAR(this->small_beta, this->small_delta, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also need to test the other vector

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still miss the other vector

Comment on lines 172 to 180
this->small_x->fill(1);
this->small_r->fill(2);
this->small_z->fill(3);
this->small_w->fill(4);

this->small_p->fill(4);
this->small_q->fill(3);
this->small_f->fill(2);
this->small_g->fill(1);

this->small_rho->at(0) = 2;
this->small_rho->at(1) = 3;
this->small_beta->at(0) = 8;
this->small_beta->at(1) = 3;
this->small_stop.get_data()[1] = this->stopped;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow AAA rules, but you can use comment to distinguish the group

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for the rest of test in this file

gko::stop::ResidualNorm<value_type>::build()
.with_reduction_factor(r<value_type>::value))
.on(exec)),
mtx_big(gko::initialize<Mtx>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the actual initiation follows the member declaration.
here you need to make it to follow the same order of member.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not fixed yet. you need to move pipe_cg_factory after mtx_big

solver->apply(b, x);

GKO_ASSERT_MTX_NEAR(x, l({33.0, -56.0, 81.0, -30.0, 21.0, 40.0}),
r<value_type>::value * 2 * 1e5);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is 2000 times of CG test. it sounds like we need to check something again.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a weird thing that I don't get. the numbers that my PIPECG implementation outputs here are the same as the tested values, and yet it fails this check. I explained this to Pratik and he fixed it by adjusting these tolerances. I've got no other idea what that could be. the output numbers are literally the same here. any suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean x is [33.0, -56.0, 81.0, -30.0, 21.0, 40.0]' on output?

Copy link
Collaborator Author

@gojakuch gojakuch Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah nevermind, I misinterpreted some things in my test example, it's out of the tolerance bounds if we don't change them

}


} // namespace
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

&stop_status));
// r = r - Ax
this->get_system_matrix()->apply(neg_one_op, dense_x, one_op, r);
// z = preconditioner * r
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is no specific reason, I will suggest the variable follows the paper notation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you're talking about variable naming, Ginkgo I chose names compatible with the CG implementation in Ginkgo. the naming in the paper contradicts the naming in CG in Ginkgo

// delta = dot(w, z)
w->compute_conj_dot(z, delta, reduction_tmp);
// check
++iter;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the criterion check should be earlier?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my understanding, the check depends on the value of rho, which is computed right before the delta calculation. since the computation of delta and rho will be merged in upcoming PRs, the check is placed after we compute both of them. or I don't get why should it be placed earlier and where? the results seem to be correct with this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the first rho is also happened in the line 139 and 141.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so are you suggesting that I put the check before computing rho in the loop?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one additional check after the first rho computation

@yhmtsai yhmtsai added 1:ST:ready-for-review This PR is ready for review and removed 1:ST:ready-to-merge This PR is ready to merge. labels Apr 11, 2025
@pratikvn
Copy link
Member

It seems nvhpc compilers still have some issues with the pipecg tolerances, so it might need to be increased even more: https://gitlab.com/ginkgo-project/ginkgo-public-ci/-/jobs/9693465904

@gojakuch
Copy link
Collaborator Author

gojakuch commented Apr 11, 2025

It seems nvhpc compilers still have some issues with the pipecg tolerances, so it might need to be increased even more: https://gitlab.com/ginkgo-project/ginkgo-public-ci/-/jobs/9693465904

I think @yhmtsai suggests that we find another solution. I don't know what can be done

@pratikvn
Copy link
Member

The tolerance problem is not limited to PipeCG, but is prevalent across all solvers in Ginkgo. We have arbitrarily increased the tolerances for different solvers to ensure that they pass the tests. The matrices we use to test the solvers are also for the most part arbitrary (we have some control: hpd-ness, sparsity, but not over condition number for example.)

A proper solution IMO would be to have full control over the matrix and vector inputs:

  1. Generate matrices with specific condition numbers and eigenvalue distributions.
  2. Define vectors with controlled vector norms.

Then we can correlate solvers with inputs and tailor the inputs rather than tailor the output tolerances. This approach requires a more sophisticated sparse matrix generation, which we currently do not have.

Therefore, IMO, the current approach should be okay, and we should revisit the controlled inputs approach at a later point (with a new PR) and update all the solvers and their tests.

@gojakuch
Copy link
Collaborator Author

@yhmtsai please consider @pratikvn's comment when reviewing. I rebased this and tried to address all the other things you've mentioned except for the tolerances

Copy link
Member

@yhmtsai yhmtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not fully agree Pratik's comments.
Pratik's comments should only for the general testing case for different executor.
In the reference test, everything is initialized manually, so everything is under control.
I have checked the implementation with the paper and I do not find any difference between them. Because the paper also showed less convergence than CG, I do not hold this PR due to the relaxed error check on big system.

The change request is for the rest of my comments.

gko::stop::ResidualNorm<value_type>::build()
.with_reduction_factor(r<value_type>::value))
.on(exec)),
mtx_big(gko::initialize<Mtx>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not fixed yet. you need to move pipe_cg_factory after mtx_big

Comment on lines +30 to +39
stopped{},
non_stopped{},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stopped and non_stopped need to between matrix and factory according to the member declaration

solver->apply(b, x);

GKO_ASSERT_MTX_NEAR(x, l({81.0, 55.0, 45.0, 5.0, 85.0, -10.0}),
r<value_type>::value * 5 * 1e3);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the condition number of mtx_big is around 315.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so what should I put as the tolerance then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I just put it here for the condition reference. from this, we can expect around 315 * r.

this->small_rho->at(1) = 3;
this->small_beta->at(0) = 8;
this->small_beta->at(1) = 3;
this->small_stop.get_data()[1] = this->stopped;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also need to initialize small_stop.get_data()[0]

this->small_w->fill(8);
this->small_m->fill(8);
this->small_n->fill(24);
this->small_rho->fill(8);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused

this->small_beta->at(1) = 3;
this->small_delta->at(0) = 5;
this->small_delta->at(1) = 6;
this->small_stop.get_data()[1] = this->stopped;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to initialize the small_stop.get_data()[0]

this->small_n.get(), this->small_prev_rho.get(), this->small_rho.get(),
this->small_delta.get(), &this->small_stop);

GKO_ASSERT_MTX_NEAR(this->small_beta, this->small_delta, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still miss the other vector


solver->apply(b, x);

GKO_ASSERT_MTX_NEAR(x, l({33.0, -56.0, 81.0, -30.0, 21.0, 40.0}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to add a TODO here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so like this or about what?

// TODO: the tolerance is too big. We might need to design better tests by generating matrices with specific condition numbers and eigenvalue distributions and defining vectors with controlled vector norms.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the sentence until ...design better tests should be good enough

@yhmtsai
Copy link
Member

yhmtsai commented Apr 22, 2025

still miss the first check when having the first rho and the test_install

@gojakuch
Copy link
Collaborator Author

still miss the first check when having the first rho and the test_install

sorry, I don't think I understood what you mean here

Copy link
Member

@yhmtsai yhmtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it still misses the part in test/test_install/test_install.cpp.
It needs a section like

// core/solver/pipe_cg.hpp
{
    using Solver = gko::solver::PipeCg<>;
    check_solver<Solver>(exec, A_raw, b, x);
}

I also add the place which should contains the check.
you can use only one check but you need to rearrange the algorithm like CG
Otherwise, use one out of the loop and another in the loop after you get the updated residual norm.

Comment on lines 164 to 176
bool all_stopped =
stop_criterion->update()
.num_iterations(iter)
.residual(r)
.implicit_sq_residual_norm(rho)
.solution(dense_x)
.check(RelativeStoppingId, true, &stop_status, &one_changed);
this->template log<log::Logger::iteration_complete>(
this, dense_b, dense_x, iter, r, nullptr, rho, &stop_status,
all_stopped);
if (all_stopped) {
break;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original place + additional check is better if you do not rearrange the check

@gojakuch
Copy link
Collaborator Author

gojakuch commented May 5, 2025

I think it still misses the part in test/test_install/test_install.cpp. It needs a section like

// core/solver/pipe_cg.hpp
{
    using Solver = gko::solver::PipeCg<>;
    check_solver<Solver>(exec, A_raw, b, x);
}

@yhmtsai As I've mentioned in my comment before: "I think another check is failing now, I suspect that it's because of me adding PipeCG to test_install like Mike suggested, while the unified kernels are not implemented in this PR. I think I'll just add it to the test_install in my next PR with kernel implementation"

meaning that I've tried adding exactly this block of code but a new check started failing after this. I'll try this again now, but I can just do this in my next PR where I implement the kernels. (I've now seen your comment under the next PR suggesting this as well, so I think that's what we are going for 👍 )

as for the check (checking twice), I'm adding this now

Copy link
Member

@yhmtsai yhmtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. sorry for missing your comment and confusing suggestion.
Yes, test_install indeed also tests for different backend, so it will be failed when the other backend are enabled.
Only one thing left is the iteration count needs to be moved a little.
Thanks for implementing it!

@gojakuch
Copy link
Collaborator Author

gojakuch commented May 5, 2025

LGTM. sorry for missing your comment and confusing suggestion. Yes, test_install indeed also tests for different backend, so it will be failed when the other backend are enabled. Only one thing left is the iteration count needs to be moved a little. Thanks for implementing it!

@yhmtsai thanks! I've rebased the PR and moved the iter counter before the second check, so that we don't check with iter==0 twice

@MarcelKoch MarcelKoch added this to the Ginkgo 1.10.0 milestone May 6, 2025
@pratikvn pratikvn added 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels May 6, 2025
@pratikvn
Copy link
Member

pratikvn commented May 6, 2025

Unfortunately, it seems nvhpc might be enabling fastmath flags and hence can produce results with lower accuracy: https://gitlab.com/ginkgo-project/ginkgo-public-ci/-/jobs/9940245846

We probably need to further increase the tolerance.

@pratikvn pratikvn merged commit 4015c54 into develop May 8, 2025
13 of 15 checks passed
@pratikvn pratikvn deleted the feat/pipe-cg branch May 8, 2025 09:20
@pratikvn
Copy link
Member

pratikvn commented May 8, 2025

Thank you @gojakuch for your contribution!

Copy link

sonarqubecloud bot commented May 8, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

1:ST:ready-to-merge This PR is ready to merge. 1:ST:run-full-test mod:reference This is related to the reference module. type:solver This is related to the solvers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants