spiluk: Limit memory usage for iw work buffer based on input matrix s… #2753

vbrunini · 2025-09-03T18:49:32Z

…ize.

Observed some application use cases where this buffer was using > 10x the amount of memory needed for the matrix being factorized which lead to OOM's later in the run because insufficient memory was available.

This is one potential solution to #2752 @vqd8a @lucbv

vqd8a · 2025-09-04T16:55:32Z

sparse/impl/KokkosSparse_spiluk_symbolic_impl.hpp

    KokkosKernels::Impl::kk_get_free_total_memory<memory_space>(free_byte, total_byte);
-    avail_byte = static_cast<size_t>(0.85 * static_cast<double>(free_byte) / static_cast<double>(nstreams));
+    double orig_matrix_bytes = entries.extent(0) * 16;
+    avail_byte = static_cast<size_t>(std::min(0.85 * static_cast<double>(free_byte), orig_matrix_bytes) / static_cast<double>(nstreams));


@vbrunini I am just concerned that when we have a matrix with a few nnz so orig_matrix_bytes is always smaller than free_byte, we will end up having a lot of chunks which may slows down the spiluk numeric. Though I have not thought of an estimation better than using orig_matrix_bytes. Have you seen any performance change on your side with this change?

I have not done any real performance comparison (just the infinite speedup of the case I was looking at failing to run due to OOM before and now it finishes).

A few other potential options I've thought about tested:

I did try removing the persistent iw View from the IlukHandle entirely and instead used a view from team.team_scratch(1). This also ran successfully, but the spiluk numeric was ~2x slower, I think because each team had to reset that scratch view to -1 at the start of each iteration rather than only resetting the portion of iw that was modified for the L & U entry columns at the end of each iteration.

Similar to using a View from team scratch I think we could potentially use a persistent View sized based on the execution space concurrency rather than the available memory, then using Kokkos' UniqueToken for each team to get a unique index into that persistent View. Potentially that could still cause memory issues if nrows is large enough though.

If iw was local to the numeric phase rather than being persistent on the iluk handle then I think the current heuristic based on the available memory could be used without causing memory usage issues for other portions of applications. The downside would be the cost of allocating, initializing, and freeing iw every call to compute.

We could add a floor here so that if the original matrix is say < 512 MB we allow iw to be up to 512 MB assuming that is small enough to not cause issues. Not sure exactly how to pick that threshold though.

I see you pick a floor of 128MB and I am fine with it. This PR looks good to me for now. Next FY, I will probably look into performance optimization for spiluk more carefully.

vqd8a · 2025-09-04T16:55:33Z

sparse/impl/KokkosSparse_spiluk_symbolic_impl.hpp

    size_t free_byte, total_byte;
    KokkosKernels::Impl::kk_get_free_total_memory<memory_space>(free_byte, total_byte);
-    avail_byte = static_cast<size_t>(0.85 * static_cast<double>(free_byte) / static_cast<double>(nstreams));
+    double orig_matrix_bytes = entries.extent(0) * 16;


@vbrunini Thanks for having the fix. Why do you choose 16? Are you based on double complex type? I am fine with it. But since the iw view's value_type is nnz_lno_t, I feel like sizeof(nnz_lno_t) is more reasonable.

It was a quick estimate at the memory usage for the input matrix assuming it was double (8 bytes per value and 8 bytes per column index since I wasn't sure off hand if the column indices were 32 or 64 bit here).

Got it. Can we use 16 + 8 instead to account for complex type?

cwpearson · 2025-09-04T21:21:01Z

@vbrunini we will also need you to sign off on the commit (git commit --amend -s)

You can also apply the format diff printed out in the failing format check, or apply clang-format 16 yourself.

…ize. Observed some application use cases where this buffer was using > 10x the amount of memory needed for the matrix being factorized which lead to OOM's later in the run because insufficient memory was available. Signed-off-by: Victor Brunini <vebruni@sandia.gov>

vqd8a

Looks good to me. Thanks for having this fix, @vbrunini !

Signed-off-by: Victor Brunini <vebruni@sandia.gov>

vbrunini · 2025-09-19T12:07:05Z

@vqd8a @cwpearson @lucbv can someone add a new approval for the CI to run? Had to fix some issues in a couple of the builds.

Signed-off-by: Victor Brunini <vebruni@sandia.gov>

vbrunini · 2025-09-19T18:57:58Z

That DCO check keeps biting me :(

cwpearson · 2025-09-19T19:09:30Z

Yeah, sorry about that. Becomes a habit eventually...

lucbv · 2025-09-22T13:39:10Z

In ~/.gitconfig you can add to the following:

[format]
        signoff = true

That should automatically apply the signoff for you on all your commits, that's how I handled this.
I think you can do the same with git config --global format.signoff true

lucbv

Based on the discussion above this looks fine to me, my only question is whether this could lead to not enough memory being allocated? I have not read the details of the spiluk algorithm so I have not assessed it... if that's the case we will want to find a way to allow for users to toggle between the new and old version.

lucbv requested review from lucbv and vqd8a September 3, 2025 19:05

vqd8a reviewed Sep 4, 2025

View reviewed changes

cwpearson added the AT2-CI-APPROVAL Approve CI to run at SNL label Sep 4, 2025

vbrunini force-pushed the spiluk_cuda_memory_usage branch from 1c939ca to be30169 Compare September 8, 2025 15:21

vqd8a approved these changes Sep 9, 2025

View reviewed changes

vqd8a and others added 2 commits September 11, 2025 14:32

Merge branch 'develop' into spiluk_cuda_memory_usage

91f72e1

spiluk: Fix format and type error.

e86e210

Signed-off-by: Victor Brunini <vebruni@sandia.gov>

vbrunini force-pushed the spiluk_cuda_memory_usage branch from 45511bd to e86e210 Compare September 15, 2025 13:37

cwpearson added AT2-CI-APPROVAL Approve CI to run at SNL and removed AT2-CI-APPROVAL Approve CI to run at SNL labels Sep 19, 2025

Format.

24ee844

Signed-off-by: Victor Brunini <vebruni@sandia.gov>

vbrunini force-pushed the spiluk_cuda_memory_usage branch from 1009492 to 24ee844 Compare September 19, 2025 18:57

cwpearson added AT2-CI-APPROVAL Approve CI to run at SNL and removed AT2-CI-APPROVAL Approve CI to run at SNL labels Sep 19, 2025

lucbv approved these changes Sep 22, 2025

View reviewed changes

Uh oh!

spiluk: Limit memory usage for iw work buffer based on input matrix s… #2753

Are you sure you want to change the base?

spiluk: Limit memory usage for iw work buffer based on input matrix s… #2753

Uh oh!

Conversation

vbrunini commented Sep 3, 2025

Uh oh!

vqd8a Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

vbrunini Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

vqd8a Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

vqd8a Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

vbrunini Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

vqd8a Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

cwpearson commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vqd8a left a comment

Choose a reason for hiding this comment

Uh oh!

vbrunini commented Sep 19, 2025

Uh oh!

vbrunini commented Sep 19, 2025

Uh oh!

cwpearson commented Sep 19, 2025

Uh oh!

lucbv commented Sep 22, 2025

Uh oh!

lucbv left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cwpearson commented Sep 4, 2025 •

edited

Loading