Skip to content

Conversation

@vbrunini
Copy link

@vbrunini vbrunini commented Sep 3, 2025

…ize.

Observed some application use cases where this buffer was using > 10x the amount of memory needed for the matrix being factorized which lead to OOM's later in the run because insufficient memory was available.

This is one potential solution to #2752 @vqd8a @lucbv

@lucbv lucbv requested review from lucbv and vqd8a September 3, 2025 19:05
KokkosKernels::Impl::kk_get_free_total_memory<memory_space>(free_byte, total_byte);
avail_byte = static_cast<size_t>(0.85 * static_cast<double>(free_byte) / static_cast<double>(nstreams));
double orig_matrix_bytes = entries.extent(0) * 16;
avail_byte = static_cast<size_t>(std::min(0.85 * static_cast<double>(free_byte), orig_matrix_bytes) / static_cast<double>(nstreams));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vbrunini I am just concerned that when we have a matrix with a few nnz so orig_matrix_bytes is always smaller than free_byte, we will end up having a lot of chunks which may slows down the spiluk numeric. Though I have not thought of an estimation better than using orig_matrix_bytes. Have you seen any performance change on your side with this change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not done any real performance comparison (just the infinite speedup of the case I was looking at failing to run due to OOM before and now it finishes).

A few other potential options I've thought about tested:

  • I did try removing the persistent iw View from the IlukHandle entirely and instead used a view from team.team_scratch(1). This also ran successfully, but the spiluk numeric was ~2x slower, I think because each team had to reset that scratch view to -1 at the start of each iteration rather than only resetting the portion of iw that was modified for the L & U entry columns at the end of each iteration.
  • Similar to using a View from team scratch I think we could potentially use a persistent View sized based on the execution space concurrency rather than the available memory, then using Kokkos' UniqueToken for each team to get a unique index into that persistent View. Potentially that could still cause memory issues if nrows is large enough though.
  • If iw was local to the numeric phase rather than being persistent on the iluk handle then I think the current heuristic based on the available memory could be used without causing memory usage issues for other portions of applications. The downside would be the cost of allocating, initializing, and freeing iw every call to compute.
  • We could add a floor here so that if the original matrix is say < 512 MB we allow iw to be up to 512 MB assuming that is small enough to not cause issues. Not sure exactly how to pick that threshold though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you pick a floor of 128MB and I am fine with it. This PR looks good to me for now. Next FY, I will probably look into performance optimization for spiluk more carefully.

size_t free_byte, total_byte;
KokkosKernels::Impl::kk_get_free_total_memory<memory_space>(free_byte, total_byte);
avail_byte = static_cast<size_t>(0.85 * static_cast<double>(free_byte) / static_cast<double>(nstreams));
double orig_matrix_bytes = entries.extent(0) * 16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vbrunini Thanks for having the fix. Why do you choose 16? Are you based on double complex type? I am fine with it. But since the iw view's value_type is nnz_lno_t, I feel like sizeof(nnz_lno_t) is more reasonable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a quick estimate at the memory usage for the input matrix assuming it was double (8 bytes per value and 8 bytes per column index since I wasn't sure off hand if the column indices were 32 or 64 bit here).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Can we use 16 + 8 instead to account for complex type?

@cwpearson cwpearson added the AT2-CI-APPROVAL Approve CI to run at SNL label Sep 4, 2025
@cwpearson
Copy link
Contributor

cwpearson commented Sep 4, 2025

@vbrunini we will also need you to sign off on the commit (git commit --amend -s)

You can also apply the format diff printed out in the failing format check, or apply clang-format 16 yourself.

…ize.

Observed some application use cases where this buffer was using > 10x the
amount of memory needed for the matrix being factorized which lead to OOM's
later in the run because insufficient memory was available.

Signed-off-by: Victor Brunini <vebruni@sandia.gov>
@vbrunini vbrunini force-pushed the spiluk_cuda_memory_usage branch from 1c939ca to be30169 Compare September 8, 2025 15:21
Copy link
Contributor

@vqd8a vqd8a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for having this fix, @vbrunini !

@vbrunini vbrunini force-pushed the spiluk_cuda_memory_usage branch from 45511bd to e86e210 Compare September 15, 2025 13:37
@vbrunini
Copy link
Author

@vqd8a @cwpearson @lucbv can someone add a new approval for the CI to run? Had to fix some issues in a couple of the builds.

@cwpearson cwpearson added AT2-CI-APPROVAL Approve CI to run at SNL and removed AT2-CI-APPROVAL Approve CI to run at SNL labels Sep 19, 2025
Signed-off-by: Victor Brunini <vebruni@sandia.gov>
@vbrunini vbrunini force-pushed the spiluk_cuda_memory_usage branch from 1009492 to 24ee844 Compare September 19, 2025 18:57
@vbrunini
Copy link
Author

That DCO check keeps biting me :(

@cwpearson
Copy link
Contributor

Yeah, sorry about that. Becomes a habit eventually...

@cwpearson cwpearson added AT2-CI-APPROVAL Approve CI to run at SNL and removed AT2-CI-APPROVAL Approve CI to run at SNL labels Sep 19, 2025
@lucbv
Copy link
Contributor

lucbv commented Sep 22, 2025

In ~/.gitconfig you can add to the following:

[format]
        signoff = true

That should automatically apply the signoff for you on all your commits, that's how I handled this.
I think you can do the same with git config --global format.signoff true

Copy link
Contributor

@lucbv lucbv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the discussion above this looks fine to me, my only question is whether this could lead to not enough memory being allocated? I have not read the details of the spiluk algorithm so I have not assessed it... if that's the case we will want to find a way to allow for users to toggle between the new and old version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AT2-CI-APPROVAL Approve CI to run at SNL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants