Skip to content

Use bit ops instead of integer modulo and divide in shaders #19994

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

atlv24
Copy link
Contributor

@atlv24 atlv24 commented Jul 7, 2025

Objective

  • Our shader compilation goes through a lot of steps, where any number of things can and do sometimes go wrong: We write our bespoke naga-oil flavor of wgsl, which then gets processed into actual wgsl, which naga then turns into naga-ir, which naga then turns into hlsl/spirv/msl, which the driver then turns into ISA, which the gpu hardware then actually runs. Some layers are lossy and not very good for performance, namely naga->hlsl seems to output some unfortunate polyfills such as:
int naga_mod(int lhs, int rhs) {
    int divisor = ((lhs == int(-2147483647 - 1) & rhs == -1) | (rhs == 0)) ? 1 : rhs;
    return lhs - (lhs / divisor) * divisor;
}

in place of %, even for constant arguments. Some driver toolchains such as FXC then go on to complain about and seemingly not realize they can optimize this.

This potentially actually ends up mattering for CI times, but we'll see.

Solution

  • Make the lives of the tools easier by sacrificing some readability. The bit ops are what we intend to happen in these cases anyways, not a full blown modulo, so in a way this is just being explicit about it.

Alternate solution considered

  • Make a naga ir process step which replaces modulo by constant power of two by the equivalent bit op, and do similar for divs and muls. This is more complicated and less explicit though

Testing

  • deferred_rendering (on all 3 modes), transmission, morph_targets, ssao, volumetric_fog

Copy link
Member

@alice-i-cecile alice-i-cecile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a rendering expert, but I would appreciate comments above the bitops that show what the equivalent non-bitops are.

@alice-i-cecile alice-i-cecile added C-Bug An unexpected or incorrect behavior A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times labels Jul 7, 2025
@alice-i-cecile alice-i-cecile added S-Needs-Benchmarking This set of changes needs performance benchmarking to double-check that they help X-Contentious There are nontrivial implications that should be thought through S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Jul 7, 2025
let material_id = vertex_input / 3u;
let vertex_index = vertex_input - material_id * 3u;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be worse. I thought compilers can see consecutive % and / and combine the instructions into one thing(?)

Copy link
Contributor Author

@atlv24 atlv24 Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not when its split into a function call that looks like

int naga_mod(int lhs, int rhs) {
    int divisor = ((lhs == int(-2147483647 - 1) & rhs == -1) | (rhs == 0)) ? 1 : rhs;
    return lhs - (lhs / divisor) * divisor;
}
// ...
let vertex_index = naga_mod(vertex_input, 3u);
let material_id = vertex_input / 3u;

let sub_xy = remap_for_wave_reduction(local_invocation_index % 64u);
let x = sub_xy.x + 8u * ((local_invocation_index >> 6u) % 2u);
let sub_xy = remap_for_wave_reduction(local_invocation_index & 63u);
let x = sub_xy.x + 8u * ((local_invocation_index >> 6u) & 1u);
let y = sub_xy.y + 8u * (local_invocation_index >> 7u);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 8u is << 3u. Or does naga deal with that properly?

let sub_xy = remap_for_wave_reduction(local_invocation_index % 64u);
let x = sub_xy.x + 8u * ((local_invocation_index >> 6u) % 2u);
let sub_xy = remap_for_wave_reduction(local_invocation_index & 63u);
let x = sub_xy.x + 8u * ((local_invocation_index >> 6u) & 1u);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 8u is << 3u

@@ -103,6 +103,13 @@ fn henyey_greenstein(neg_LdotV: f32) -> f32 {
return FRAC_4_PI * (1.0 - g * g) / (denom * sqrt(denom));
}

fn simple_wrap_3(index: i32) -> i32 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can index be >=6? If so then this is wrong, if not then I think this function should be named/commented to indicate it is special purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its only used in this file, and its only called with numbers in range 0-5. I called it simple_wrap_3, not implying its modulo, because it is not an implementation of modulo, just something that works for this specific case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment saying that it only works for values 0 to 5, which is fine for its use in this file?

Copy link
Contributor

@superdump superdump left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks fine to me.

I noted early in my review that multiplication by powers of two is not changed to left shifts. Does naga handle this or is there no value to it?

Aside from that, just one comment about the simple_wrap_3 function to address.

@atlv24
Copy link
Contributor Author

atlv24 commented Jul 7, 2025

Integer multiplication has a dedicated instruction on gpus, there's no need to replace it, bit shifts win us nothing in that case. Integer division and modulos are the real expensive ones, they are on the order of 100 cycles, whereas integer multiplies are usually a cycle or two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen C-Bug An unexpected or incorrect behavior C-Performance A change motivated by improving speed, memory usage or compile times S-Needs-Benchmarking This set of changes needs performance benchmarking to double-check that they help S-Needs-Review Needs reviewer attention (from anyone!) to move forward X-Contentious There are nontrivial implications that should be thought through
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

5 participants