Skip to content

windows: Use new hints.mostly-unused #3660

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 23, 2025

Conversation

joshtriplett
Copy link
Contributor

@joshtriplett joshtriplett commented Jul 13, 2025

Most users of the windows crate will use a fraction of its API surface
area.

Nightly rustc provides an option -Zhint-mostly-unused to tell it to
defer as much compilation as possible, which provides a substantial
performance improvement if most of that compilation doesn't end up
happening. Cargo plumbs this option through using the new [hints]
table. This will cause users of the windows crate to default to
setting hint-mostly-unused. (Top-level crates can override this if
they wish, using a new profile option.)

Note that setting this hint does not increase the MSRV of the Windows
crate, as old versions of Cargo will ignore it. New versions of Cargo
will respect it automatically (and, until we stabilize it, Cargo will do
nothing unless you pass -Zprofile-hint-mostly-unused to cargo).

Some sample performance numbers:

Dependency Crate Before hint-mostly-unused Delta
windows, all Graphics/UI features 18.3s 10.7s -42%
windows, all features 3m 48s 2m 55s -23%

@riverar
Copy link
Collaborator

riverar commented Jul 13, 2025

Wow that's encouraging!

@kennykerr
Copy link
Collaborator

Thanks! That seems like a great improvement. Two quick questions:

  • Why is this not enabled by default when available, as apposed to requiring the hint?
  • Would it make sense to add this hint to the windows-sys crate as well or are the savings mostly around function bodies?

@kennykerr
Copy link
Collaborator

I kicked off #3661 to see whether we can identify an observable improvement with this flag. Perhaps I'm doing something wrong but it doesn't appear to help.

Without (master): https://github.com/microsoft/windows-rs/actions/runs/16221219482
With hint (this PR): https://github.com/microsoft/windows-rs/actions/runs/16272221637

Thoughts?

@joshtriplett
Copy link
Contributor Author

@kennykerr That CI job is doing check, which already skips code generation. This hint matters for folks who are doing build (and especially build -r).

@kennykerr
Copy link
Collaborator

Thanks Josh, I appreciate that clarification. Sorry if it is a little misleading, but that workflow runs cargo test individually on all of the crates in the repo (which is a lot). Testing requires code generation or does this hint truly only benefit cargo build?

image

@joshtriplett
Copy link
Contributor Author

  • Why is this not enabled by default when available, as apposed to requiring the hint?

Because it's a performance loss if applied when it isn't a good fit.

If you apply it to a crate with 1000 items of which almost every user uses ~10, it's a big win. If you apply it to a crate with 10 items of which the average user uses most of them, it's not just neutral, it'll likely make compilation time worse.

@joshtriplett
Copy link
Contributor Author

  • Would it make sense to add this hint to the windows-sys crate as well or are the savings mostly around function bodies?

I don't know if it would make sense. It's worth testing.

@joshtriplett
Copy link
Contributor Author

@kennykerr Ah, I see; I saw the titles all saying "check" and made an incorrect assumption.

It can benefit test, but only if the tests exercise only a small fraction of the API surface area. If the tests are anywhere near comprehensive, then they won't demonstrate any benefit.

The performance win comes from real-world crates using windows, which often pull it in and only call a few functions.

@riverar
Copy link
Collaborator

riverar commented Jul 15, 2025

It can benefit test, but only if the tests exercise only a small fraction of the API surface area. If the tests are anywhere near comprehensive, then they won't demonstrate any benefit.

The tests are tiny but so is the function space, due to their limited number of feature enabled. I wonder if these speed improvements are specific to crates bringing in windows with features like Windows_Win32_UI_Shell enabled.

@joshtriplett
Copy link
Contributor Author

Oh, I just realized the likely problem. This is being trialed in nightly, so you'll need to use a nightly rustc/cargo, and pass -Zprofile-hint-mostly-unused to cargo, or it'll do nothing.

@kennykerr
Copy link
Collaborator

I used -Zhint-mostly-unused here:

https://github.com/microsoft/windows-rs/pull/3661/files

Should I instead use -Zprofile-hint-mostly-unused?

@joshtriplett
Copy link
Contributor Author

I used -Zhint-mostly-unused here:

https://github.com/microsoft/windows-rs/pull/3661/files

Should I instead use -Zprofile-hint-mostly-unused?

Setting RUSTFLAGS=-Zhint-mostly-unused will have the net effect of setting it for every dependency; that may not be a good idea. I would suggest setting the hint and then using cargo -Zprofile-hint-mostly-unused.

@kennykerr
Copy link
Collaborator

Thanks, I made the suggested changes to #3661 but I don't see a noticeable improvement.

@joshtriplett
Copy link
Contributor Author

joshtriplett commented Jul 16, 2025

@kennykerr 🤦 I just realized what the problem is here.

https://github.com/microsoft/windows-rs/actions/runs/16304789650/job/46048204434?pr=3661#step:162:7

warning: D:\a\windows-rs\windows-rs\crates\libs\windows\Cargo.toml: unused manifest key: hints

Before putting out a call for testing, it would have been good to make sure the change in cargo was synced to rust-lang/rust (which is a manual process).

rust-lang/rust#143998

It looks like this might take until the 2025-07-17 nightly. I'll update the blog post.

@kennykerr
Copy link
Collaborator

No problem, we can kick that PR again when the latest nightly is available.

Most users of the `windows` crate will use a fraction of its API surface
area.

Nightly rustc provides an option `-Zhint-mostly-unused` to tell it to
defer as much compilation as possible, which provides a substantial
performance improvement if most of that compilation doesn't end up
happening. Cargo plumbs this option through using the new `[hints]`
table. This will cause users of the `windows` crate to default to
setting `hint-mostly-unused`. (Top-level crates can override this if
they wish, using a new profile option.)

Note that setting this hint does not increase the MSRV of the Windows
crate, as old versions of Cargo will ignore it. New versions of Cargo
will respect it automatically (and, until we stabilize it, Cargo will do
nothing unless you pass `-Zprofile-hint-mostly-unused` to cargo).

Some sample performance numbers: this takes `windows` compilation time
with all Graphics and UI features enabled from 18.3s to 10.7s (a 42%
improvement), and takes compilation time with *all* features enabled
from 3m48s to 2m55s (a 23% improvement).
@joshtriplett
Copy link
Contributor Author

@kennykerr Current nightly as of today should now work. Give it another try?

@kennykerr
Copy link
Collaborator

I reran https://github.com/microsoft/windows-rs/actions/runs/16361241438 but still don't see any noticeable improvement.

@kennykerr
Copy link
Collaborator

Can you share an example where this clearly helps?

@kennykerr
Copy link
Collaborator

By example I mean something like this.

Before:

E:\git\windows-rs>cls && cargo clean && cargo build -p sample_direct2d
     Removed 644 files, 405.6MiB total
warning: windows@0.61.3: ignoring 'hints.mostly-unused', pass `-Zprofile-hint-mostly-unused` to enable it
<snip>
   Compiling sample_direct2d v0.0.0 (E:\git\windows-rs\crates\samples\windows\direct2d)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 13.97s

After:

E:\git\windows-rs>cls && cargo clean && cargo build -p sample_direct2d -Zprofile-hint-mostly-unused
     Removed 681 files, 452.4MiB total
<snip>
   Compiling sample_direct2d v0.0.0 (E:\git\windows-rs\crates\samples\windows\direct2d)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 13.86s

The change is harmless enough, but we need a compelling example that illustrates to early adopters how it might be beneficial for them in general. This example, which is very representative, just does not bring the advertised 20-40% improvement.

@joshtriplett
Copy link
Contributor Author

joshtriplett commented Jul 22, 2025

@kennykerr The net effect of the change is larger the more feature flags you have enabled on the windows crate. If you enable very few features, the codegen time is already small enough that the savings is hard to measure (but it does no harm). If you enable more features (or features that gate large API surfaces), the savings become more obvious.

That said, you'll notice the effect more strongly in release builds:

~/src/windows-rs$ hyperfine -M 4 -p 'cargo clean' 'cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-gnu -Zprofile-hint-mostly-unused' 'cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-gnu'
Benchmark 1: cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-gnu -Zprofile-hint-mostly-unused
  Time (mean ± σ):      8.458 s ±  0.086 s    [User: 10.598 s, System: 1.248 s]
  Range (min … max):    8.362 s …  8.555 s    4 runs
 
Benchmark 2: cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-gnu
  Time (mean ± σ):      9.011 s ±  0.081 s    [User: 13.206 s, System: 1.287 s]
  Range (min … max):    8.903 s …  9.098 s    4 runs
 
Summary
  cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-gnu -Zprofile-hint-mostly-unused ran
    1.07 ± 0.01 times faster than cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-gnu

(Also note the difference in "User" time, which reflects CPU time used by all threads.)

The effect becomes larger the more features you enable; for instance, if I enable all of Graphics_*, I get a 1.43x difference rather than 1.07x. Data_* shows 1.28x.

@riverar
Copy link
Collaborator

riverar commented Jul 22, 2025

@joshtriplett Are you testing on Windows? macOS? Other? Just wanted to follow along and make sure I'm on the same machine.

@joshtriplett
Copy link
Contributor Author

@joshtriplett Are you testing on Windows? macOS? Other? Just wanted to follow along and make sure I'm on the same machine.

I'm cross-compiling from Linux.

@kennykerr
Copy link
Collaborator

I have tried release builds as well and it makes no difference. Perhaps it is unique to GNU or Linux builds.

@joshtriplett
Copy link
Contributor Author

joshtriplett commented Jul 22, 2025

I have tried release builds as well and it makes no difference. Perhaps it is unique to GNU or Linux builds.

Can you post the output from the same hyperfine command I ran (but for whichever target you prefer, e.g. -msvc)?

Also, how many CPUs are you building on?

@riverar

This comment was marked as outdated.

@riverar
Copy link
Collaborator

riverar commented Jul 22, 2025

Running with the correct branch this time test-hint-mostly-unused.

Windows 26200.5702 / msvc 17.14.5-pre1
rustc 1.90.0-nightly (9748d87dc 2025-07-21)
32virt / 16phy cores

Run without flag -Zprofile-hint-mostly-unused:

hyperfine -M 4 -p 'cargo clean' 'cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-msvc' 'cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-msvc'
Benchmark 1: cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-msvc
  Time (mean ± σ):     18.961 s ±  4.717 s    [User: 12.921 s, System: 2.600 s]
  Range (min … max):   12.208 s … 23.201 s    4 runs

Benchmark 2: cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-msvc
  Time (mean ± σ):     16.071 s ±  3.417 s    [User: 12.917 s, System: 2.600 s]
  Range (min … max):   12.405 s … 20.577 s    4 runs

Summary
  cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-msvc ran
    1.18 ± 0.39 times faster than cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-msvc

Run with flag against sample_direct2d:

hyperfine -M 4 -p 'cargo clean' 'cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-msvc -Zprofile-hint-mostly-unused' 'cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-msvc'
Benchmark 1: cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-msvc -Zprofile-hint-mostly-unused
  Time (mean ± σ):     14.953 s ±  3.616 s    [User: 10.967 s, System: 2.299 s]
  Range (min … max):   12.395 s … 20.173 s    4 runs

Benchmark 2: cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-msvc
  Time (mean ± σ):     19.158 s ±  4.786 s    [User: 12.862 s, System: 2.900 s]
  Range (min … max):   12.989 s … 23.831 s    4 runs

Summary
  cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-msvc -Zprofile-hint-mostly-unused ran
    1.28 ± 0.45 times faster than cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-msvc

Run with flag against modified sample_direct2d (including all Graphics_* features):

hyperfine -M 4 -p 'cargo clean' 'cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-msvc -Zprofile-hint-mostly-unused' 'cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-msvc'
Benchmark 1: cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-msvc -Zprofile-hint-mostly-unused
  Time (mean ± σ):     21.136 s ±  7.984 s    [User: 12.952 s, System: 2.495 s]
  Range (min … max):   14.721 s … 32.063 s    4 runs

Benchmark 2: cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-msvc
  Time (mean ± σ):     19.191 s ±  6.038 s    [User: 19.584 s, System: 3.147 s]
  Range (min … max):   16.033 s … 28.246 s    4 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-msvc ran
    1.10 ± 0.54 times faster than cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-msvc -Zprofile-hint-mostly-unused

Re-run (to eliminate outliers warning) with flag against modified sample_direct2d (including all Graphics_* features):

hyperfine -M 4 -p 'cargo clean' 'cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-msvc -Zprofile-hint-mostly-unused' 'cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-msvc'
Benchmark 1: cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-msvc -Zprofile-hint-mostly-unused
  Time (mean ± σ):     21.034 s ±  6.907 s    [User: 12.627 s, System: 2.999 s]
  Range (min … max):   15.290 s … 30.111 s    4 runs

Benchmark 2: cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-msvc
  Time (mean ± σ):     22.919 s ±  6.668 s    [User: 19.834 s, System: 3.850 s]
  Range (min … max):   16.242 s … 32.130 s    4 runs

Summary
  cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-msvc -Zprofile-hint-mostly-unused ran
    1.09 ± 0.48 times faster than cargo +nightly build -r -p sample_direct2d --target x86_64-pc-windows-msvc

@riverar
Copy link
Collaborator

riverar commented Jul 22, 2025

Can't tell anything from this data, it's too noisy. Will try cranking up the number of runs.

@joshtriplett
Copy link
Contributor Author

joshtriplett commented Jul 22, 2025

@riverar The "User" time numbers are pretty definitive already. Is it possible that the wall-clock numbers are being affected by other tasks happening on your system?

(Also, the run you're doing labeled "Run without flag" is testing the same build twice. Each hyperfine invocation is already comparing results with and without the flag.)

@riverar
Copy link
Collaborator

riverar commented Jul 22, 2025

Those numbers are misleading--with a huge ~±0.50 variance, the faster results could actually be much slower (e.g., 30%). (The user time difference does look more promising, agree.)

I'm trying to complete 100 runs but statistical outliers keep showing up. My dev drive (specialized ReFS) or system must be unstable/noisy.

(Also, the run you're doing labeled "Run without flag" is testing the same build twice. Each hyperfine invocation is already comparing results with and without the flag.)

Understood. That was just a run to get an idea how unstable the tests were. I was expecting with that run to be closer to 1.0x than it spat out.

(Done with edits.)

Copy link
Collaborator

@kennykerr kennykerr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution.

@kennykerr kennykerr merged commit cff9e38 into microsoft:master Jul 23, 2025
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants