-
Notifications
You must be signed in to change notification settings - Fork 90
Optimizing StokesFOResid for LandIce 3D #1048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. I'm assuming the numbers are on pmgpu? Do you have the numbers on frontier?
Yes, this is on pmgpu. I haven't tried frontier. |
Thanks @OscarAntepara ! |
Right, it's better to write locally in the internal loop than write to device memory outside the loop. That might not be the case on CPU... Oscar could you check cpu performance?
We can probably do that with an inline device function but it might get ugly. We thought by keeping the old implementation it might improve readability (similar to what we do for the optimized gradient). |
Yeah, doing the accumulation locally improves data locality in the GPU which gives a better performance. I will check the cpu performance. |
void StokesFOResid<EvalT, Traits>:: | ||
operator() (const LandIce_3D_Opt_Tag<NumNodes>& tag, const int& cell) const{ | ||
static constexpr int num_nodes = tag.num_nodes; | ||
ScalarT res0[num_nodes]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the code compile if you do the following?
ScalarT res0[num_nodes] = {0};
It should do value-initialization, and init all entries with 0's. Maybe it could save you the loop right below. However, I'm not sure if it works correctly with FadTypes...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be okay, we do it in another evaluator and it hasn't caused issues with FadTypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or default initialization, which probably is the same as initializing it to zero.
ScalarT res0[num_nodes] = {};
Yes, FAD types have default initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this would work too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. Both options work! So I'll push ScalarT res0[num_nodes] = {};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, thanks Oscar!
OK, it's just that if we need to modify that computation kernel we have to do it in multiple places. |
Right, probably okay to turn into an inline function then. I don't think it will be too complicated in terms of readability. |
For the 16km test with CPUs there is not much difference between the original and the new code. New: |
…o avoid code duplication for the operator not using StereographicMap
I have modified the code to avoid the code duplication mentioned before. |
Thanks! I think it's a reasonable approach. Have you tested it already? I'm not sure if we still have some tests using Tets, in which case you would have to add a specialization for |
I just have tested it for the ant-16km that is just with numNodes=8. Is there a test for LandIce3D with numNodes=4? is that using just tetrahedrons? |
We used to have a lot, but e converted most of them to Wedges. I couldn't find one with a quick search. Can you simply run Albany landice ctests? |
Yeah, I got this:
70% tests passed, 6 tests failed out of 20 Label Time Summary: Total Test time (real) = 890.49 sec The following tests FAILED: |
for 19 and 20, you probably need to put the trilinos libs in your |
Trueee, now I got this: 80% tests passed, 4 tests failed out of 20 Label Time Summary: Total Test time (real) = 521.31 sec The following tests FAILED: |
OK, I don't think that your changes are affecting the unit tests, so I think you are good to go. In case you want to understand what's going on, you can run single tests with verbose output doing: |
If people are curious, I'm seeing the same errors that you have here: https://my.cdash.org/viewTest.php?buildid=2560428 |
Thanks, we should open an issue about those tests failing. |
Those tests are currently not expected to pass since these are uvm-free builds. If you want I can start an issue that tracks the status of uvm-free tests and which are currently known to fail. |
Oh, OK. Should we disable them in UVM-free builds? Anyway, I'm fine either way, and it's OK to do nothing if you plan to make these tests work in UVM-free builds. |
Frontier numbers from Oscar:
New:
|
Optimizing StokesFOResid for LandIce 3D by cleaning the code, removing if statements inside the kernel, doing local accumulation and using compile time variables for the loops.
For the ant-16km test, the original code timings are:
New code timings are: