-
Couldn't load subscription status.
- Fork 109
spiluk: Limit memory usage for iw work buffer based on input matrix s… #2753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
| KokkosKernels::Impl::kk_get_free_total_memory<memory_space>(free_byte, total_byte); | ||
| avail_byte = static_cast<size_t>(0.85 * static_cast<double>(free_byte) / static_cast<double>(nstreams)); | ||
| double orig_matrix_bytes = entries.extent(0) * 16; | ||
| avail_byte = static_cast<size_t>(std::min(0.85 * static_cast<double>(free_byte), orig_matrix_bytes) / static_cast<double>(nstreams)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vbrunini I am just concerned that when we have a matrix with a few nnz so orig_matrix_bytes is always smaller than free_byte, we will end up having a lot of chunks which may slows down the spiluk numeric. Though I have not thought of an estimation better than using orig_matrix_bytes. Have you seen any performance change on your side with this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not done any real performance comparison (just the infinite speedup of the case I was looking at failing to run due to OOM before and now it finishes).
A few other potential options I've thought about tested:
- I did try removing the persistent iw View from the IlukHandle entirely and instead used a view from
team.team_scratch(1). This also ran successfully, but the spiluk numeric was ~2x slower, I think because each team had to reset that scratch view to-1at the start of each iteration rather than only resetting the portion of iw that was modified for the L & U entry columns at the end of each iteration. - Similar to using a View from team scratch I think we could potentially use a persistent View sized based on the execution space concurrency rather than the available memory, then using Kokkos' UniqueToken for each team to get a unique index into that persistent View. Potentially that could still cause memory issues if nrows is large enough though.
- If iw was local to the numeric phase rather than being persistent on the iluk handle then I think the current heuristic based on the available memory could be used without causing memory usage issues for other portions of applications. The downside would be the cost of allocating, initializing, and freeing iw every call to compute.
- We could add a floor here so that if the original matrix is say < 512 MB we allow iw to be up to 512 MB assuming that is small enough to not cause issues. Not sure exactly how to pick that threshold though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you pick a floor of 128MB and I am fine with it. This PR looks good to me for now. Next FY, I will probably look into performance optimization for spiluk more carefully.
| size_t free_byte, total_byte; | ||
| KokkosKernels::Impl::kk_get_free_total_memory<memory_space>(free_byte, total_byte); | ||
| avail_byte = static_cast<size_t>(0.85 * static_cast<double>(free_byte) / static_cast<double>(nstreams)); | ||
| double orig_matrix_bytes = entries.extent(0) * 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vbrunini Thanks for having the fix. Why do you choose 16? Are you based on double complex type? I am fine with it. But since the iw view's value_type is nnz_lno_t, I feel like sizeof(nnz_lno_t) is more reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a quick estimate at the memory usage for the input matrix assuming it was double (8 bytes per value and 8 bytes per column index since I wasn't sure off hand if the column indices were 32 or 64 bit here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Can we use 16 + 8 instead to account for complex type?
|
@vbrunini we will also need you to sign off on the commit ( You can also apply the format diff printed out in the failing format check, or apply clang-format 16 yourself. |
…ize. Observed some application use cases where this buffer was using > 10x the amount of memory needed for the matrix being factorized which lead to OOM's later in the run because insufficient memory was available. Signed-off-by: Victor Brunini <vebruni@sandia.gov>
1c939ca to
be30169
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks for having this fix, @vbrunini !
Signed-off-by: Victor Brunini <vebruni@sandia.gov>
45511bd to
e86e210
Compare
|
@vqd8a @cwpearson @lucbv can someone add a new approval for the CI to run? Had to fix some issues in a couple of the builds. |
1009492 to
24ee844
Compare
|
That DCO check keeps biting me :( |
|
Yeah, sorry about that. Becomes a habit eventually... |
|
In That should automatically apply the signoff for you on all your commits, that's how I handled this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the discussion above this looks fine to me, my only question is whether this could lead to not enough memory being allocated? I have not read the details of the spiluk algorithm so I have not assessed it... if that's the case we will want to find a way to allow for users to toggle between the new and old version.
…ize.
Observed some application use cases where this buffer was using > 10x the amount of memory needed for the matrix being factorized which lead to OOM's later in the run because insufficient memory was available.
This is one potential solution to #2752 @vqd8a @lucbv