Skip to content

[inlining] increasing accuracy of the code size assessment model #7648

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

HerrCai0907
Copy link
Contributor

@HerrCai0907 HerrCai0907 commented Jun 10, 2025

Old code size measure is simply based on operation count in callee. It will miss the opportunity like

(func (param $ptr i32) (param $value i32)
  local.get  $ptr
  local.get $value
  i32.store
)

which is always happened for variable setter in OO language.

In this PR, we want to introduce inlining specific measurer to ignore this pattern for better inlining result.
The new measurer will ignore all occurrences of local.get that occur in sequence.
The behavior of the new measurer will be consistent with that of the old measurer in:

  1. When parameters are not retrieved in sequence, a new local variable will be introduced after inlining. At this point, to prevent an increase in code size,
  2. produce operand stack with global.get or xxx.const
  3. update the value of local with local.set

In a word, the new measure want to match this pattern: The function body only contain local.get $param_n and trivial consumer operation like xxx.add xxx.store if etc.

@kripken
Copy link
Member

kripken commented Jun 11, 2025

Interesting idea. Do you have measurements on some benchmark perhaps? I'm not sure I have a good guess about this without data.

@HerrCai0907
Copy link
Contributor Author

HerrCai0907 commented Jun 12, 2025

I use assemblyscript bootstrap as bench
wasm-opt flags: --shrink-level 2 --all-features -Oz

wasm-opt size (Bytes)
main@2b989ae 791963
after PR 792066 (+103)
after PR but treat store as invalid 791896 (-67)

The store will use at least 3 bytes which is 1 byte larger than call, so inline it will increase 1 byte in WASM size. But it is depends on the frequency... But if we only keep the normal algorithm operation, It is definitely profitable.

@HerrCai0907
Copy link
Contributor Author

And I think it is worth to inline if we want a balance in both shrink level and optimize level. It will not increase code size much and inlining can reduce the cost of function call and introduce more optimization opportunity later.

@osa1
Copy link
Contributor

osa1 commented Jun 23, 2025

I just realized that I'm fixing the same issue with #7669 and #7670. (note: #7670 is based on #7669, it includes changes in #7669)

My fix is different: I don't change code size measurements, instead improve the existing "trivial call" heuristic.

@HerrCai0907 it would be interesting to see how much different #7670 makes on the same benchmark you run above, and whether it optimizes the same way.

(Note: I need to rebase #7670, which I'll do now.)

@osa1
Copy link
Contributor

osa1 commented Jun 25, 2025

I think the PRs linked above should be "safer" in the sense that they won't cause accidentally inlining too much, because they don't change size calculations.

But even with those merged I think this PR may make sense. local.gets in the right order never increases call site sizes, and if the function size is an indicative of how large an inlined call site can get, then local.gets in the right order shouldn't contribute to the sizes.

Assuming we merge my PRs, I would make these changes in this PR:

  • Remove the part that makes the function size 0. With my PRs we already inline these functions regardless of their sizes.
  • Ignore ordered local.gets, but:
    • For unused arguments, add 1 to the code size, for the potential drop introduced at the call sites.
    • For constants between local.gets, add sizes of the constant instructions to the size.

(If we see a non-constant instruction between two local.gets, the rest of the argument need to be count as normal because new locals may need for each of the arguments after the non-constant instruction, to maintain evaluation order.)

So in a function like:

(func $f (param $x1) (param $x2) (param $x3)
  (call $g
    (i32.const 123)
    (local.get $x2)
    (i32.const 456)
    (local.get $x3)))

We count i32.const instructions, we add 1 for the missing x1 (accounting for potential drops at the call sites), and ignore x2 and x3.

This can potentially cover code like: f calls g but with an extra constant argument passed:

void f(arg1, arg2, arg3) {
    return g(arg1, arg2, arg3, 0);
}

This kind of thing can be generated when at the source code you have default arguments which some call sites omit, and the default value is a constant integer like 0 above.

In other words, consider "size" as the worst case code size contribution to call sites, rather than the size of the function being inlined.

(We could even consider the number of extra locals required when the function locals are not used in order, but I'm not bright enough to immediately see what kind of algorithm would be needed here.. and we're already probably getting very little benefit with all of this after the other two PRs linked above)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants