Skip to content

fixed scalarIndexing performance issue #1215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

riteshbhirud
Copy link

fixes #1209

Fixes performance regression in the NaN counting callback.

Replaced inefficient scalar operations with proper ClimaCore field operations:

  • Eliminated allowscalar block entirely
  • Used field broadcasting with @. ifelse(isnan(state), ...) instead of parent(state)
  • Reduced allocations by avoiding intermediate boolean arrays
  • Maintained identical mathematical behavior and accuracy

Results:

  • Eliminates GPU scalar indexing bottleneck
  • Reduces memory allocations
  • Should restore normal SYPD when using NaN callback
  • No change to functionality or accuracy

@kmdeck
Copy link
Member

kmdeck commented Jul 8, 2025

Thanks @riteshbhirud! Does this reduce allocations, though? I believe this line still allocates:
@. ifelse(isnan(state), 1, 0) (and similar broadcasted expressions).

Also, confusingly, the sum of a field does an integral, and not a similar arithmetic sum. But if we take the sum of the parent(field), it calls the method of sum for an array.

Comment on lines 596 to 601
if isnothing(mask)
num_nans = sum(@. ifelse(isnan(state), 1, 0))
else
num_nans = sum(@. ifelse(isnan(state), mask, 0))
end
num_nans = Int(num_nans)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can further optimize this and reduce some of the allocations. For the no mask case, we can go from 2 kernel launches to just one if we use mapreduce or count. I believe @kmdeck is correct, and line 597 will still allocate an intermediate. For the case with a mask, we can also get away with only one kernel launch and no allocations with mapreduce.

That could look something like

Suggested change
if isnothing(mask)
num_nans = sum(@. ifelse(isnan(state), 1, 0))
else
num_nans = sum(@. ifelse(isnan(state), mask, 0))
end
num_nans = Int(num_nans)
if isnothing(mask)
num_nans = count(isnan, parent(state)
else
num_nans = mapreduce((s, m) -> m != 0 && isnan(s), Base.add_sum, parent(state), parent(mask))
end

That would also avoid the type conversion.

I'm not sure if masks are guaranteed to be boolean valued, so I'm not sure if line 599 will always be valid

Copy link
Member

@kmdeck kmdeck Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

The mask is not boolean, because we create it using this:

apply_threshold(field, value) =

I think we did this because you cant create a field of bools by broadcasting over a field of floats. But we can do something like:

apply_threshold(field, value) =
    field > value ? 0 : 1

and then later down in the landsea_mask function, change thishttps://github.com/CliMA/ClimaLand.jl/blob/7ac4cd452663559f20df22d2a60e8f1aaa92d90f/src/shared_utilities/Domains.jl#L1088

to

binary_mask = ClimaCore.Fields.Field(Bool, surface_space)
fill!(ClimaCore.Fields.field_values(binary_mask), 0)
@. binary_mask = apply_threshold(mask, threshold)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other thought is that we call this function for each state variable in Y, but with one exception, if one is NaN, the entire state is NaN at that same location. so maybe we can also speed it up by calling it for just one field.

I'm not sure how to do that in a nice automated way though, without making an assumption about what is in Y

@riteshbhirud
Copy link
Author

apologies for the allocations, I have now replaced the multi-step array creation and summation with single-kernel count/mapreduce operations to eliminate allocations. Can you please take a look once you have chance if this is what we expected?

@kmdeck
Copy link
Member

kmdeck commented Jul 8, 2025

apologies for the allocations, I have now replaced the multi-step array creation and summation with single-kernel count/mapreduce operations to eliminate allocations. Can you please take a look once you have chance if this is what we expected?
Edit - I enabled CI to run so we can see if the tests pass

@kmdeck
Copy link
Member

kmdeck commented Jul 9, 2025

@riteshbhirud I am going to try running a simulation where we use the NaN check callback and see if it runs and if SYPD increases :) will report back!

Copy link
Member

@imreddyTeja imreddyTeja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this is great!

Looking at just the callback by itself, this change makes the first call slightly slower, but then the subsequent calls are about 5x faster. @allocations also reports no allocations now!

I'll run a benchmark comparing this to main.

Could you please run the .dev/climaformat.jl to format these changes? It looks like the formatting test is failing.

@riteshbhirud
Copy link
Author

@imreddyTeja I ran the .dev/climaformat.jl. Can you please run the CI if all good to merge?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NaN check is slow
3 participants