Skip to content

Faster schoolbook multiplication for sparse inputs #867

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dvdplm
Copy link
Contributor

@dvdplm dvdplm commented Jul 10, 2025

In the end this effort failed, but here's a report on what I tried so others can avoid the effort.

I tried several diffferent approaches:

  1. Un-nesting the loops. Instead of having two nested loops, first over the LHS limbs and an inner loop over the RHS limbs and figuring out if we're writing to the hi or lo parts of the output, have two disjoint loops writing first to the lo parts and then to the hi parts. Had great hopes for this but no matter how I benchmarked it was always slower.
  2. Un-rolling, unsafe and manual inlining. Tried to see how close the original implementation is compared to no-compromise, aggressive manual micro-optimization. Terrible looking code, get_unchecked and mutation-in-place all over, that sort of thing. Result: a fair bit slower than the original, especially when multiplying numbers of different sizes, e.g. U512 * U256.
  3. Pre-calculate iteration bounds. The idea here was to see if some pre-processing of the inputs would help. Check if there are leading or trailing zero-limbs, and if so, adjust iteration bounds to skip un-needed work. One version of this idea is somewhat succesful and is shown in this PR, but in prior versions I was trying to avoid branches by checking the inputs outside of the loops. This turns out to be slower than checking if the current limb is zero inside the loop. This is an interesting result indicating that the CPU can work more efficiently if it knows the exact iteration limits up-front, rather than using a dynamically calculated value?

The version I finally settled on is much faster for "sparse" limbs, ~4x faster, and is about 10% slower for "dense" operands. It keeps the original's loop structure and adds two optimizations:

  1. If the current LHS limb is zero, skip the iteration.
  2. If the current RHS limb is zero, speculatively check if the remaining RHS limbs are also zero and if they are, fast-forward the loop.

The results, as hinted to above, are meager. When the inputs contain zero-limbs there's a significant speedup; how much obviously depend on the data but 2x to 8x is the range I have seen in my benchmarks, but this comes at a cost for more commonly seen inputs. For some inputs sizes the difference is small, less than 5%, but sometimes it's plenty larger than that.

When running the whole benchmark suite, the sparse number optimization does show up, yielding more than 5% change in 27 out of 137 tests (mostly speedups, but some bad regressions too), but overall the bag is too mixed and not in the right way.

Overall I do not think this PR should be merged.

Benchmark results

Benchmark results

Comparing the schoolbook multiplication routine of this PR to master. Here "small" means Uints with many empty limbs, showing off the optimization proposed.

    Finished `bench` profile [optimized] target(s) in 0.03s
     Running benches/schoolbook_optimization.rs (target/release/deps/schoolbook_optimization-f73306df8a7c43a1)
schoolbook multiplication/256_small
                        time:   [1.8621 ns 1.8653 ns 1.8702 ns]
                        change: [−64.020% −63.899% −63.781%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  5 (5.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  8 (8.00%) high severe
schoolbook multiplication/256_large
                        time:   [5.4027 ns 5.4105 ns 5.4206 ns]
                        change: [+5.0330% +5.2921% +5.5365%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  10 (10.00%) high severe
schoolbook multiplication/512_small
                        time:   [5.1337 ns 5.1424 ns 5.1524 ns]
                        change: [−78.730% −78.666% −78.602%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) high mild
  8 (8.00%) high severe
schoolbook multiplication/512_large
                        time:   [36.978 ns 37.061 ns 37.158 ns]
                        change: [+45.556% +46.047% +46.505%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) low severe
  2 (2.00%) high mild
  11 (11.00%) high severe
schoolbook multiplication/1024_small
                        time:   [14.984 ns 15.084 ns 15.195 ns]
                        change: [−87.276% −87.187% −87.101%] (p = 0.00 < 0.05)
                        Performance has improved.
schoolbook multiplication/1024_large
                        time:   [177.53 ns 178.01 ns 178.63 ns]
                        change: [+35.184% +36.066% +36.891%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild
schoolbook multiplication/1024x256_small
                        time:   [7.6910 ns 7.7112 ns 7.7340 ns]
                        change: [−78.357% −78.276% −78.193%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  5 (5.00%) high severe
schoolbook multiplication/1024x256_large
                        time:   [39.127 ns 39.156 ns 39.196 ns]
                        change: [+9.1693% +9.5518% +9.9466%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  8 (8.00%) high severe
schoolbook multiplication/512x256_small
                        time:   [4.0481 ns 4.0678 ns 4.0909 ns]
                        change: [−60.056% −59.793% −59.479%] (p = 0.00 < 0.05)
                        Performance has improved.
schoolbook multiplication/512x256_large
                        time:   [20.488 ns 20.719 ns 21.009 ns]
                        change: [+97.759% +99.380% +101.23%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe
schoolbook multiplication/1024x512_large
                        time:   [75.598 ns 75.681 ns 75.791 ns]
                        change: [+13.272% +13.665% +14.074%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe
schoolbook multiplication/256_sparse
                        time:   [1.8942 ns 1.9061 ns 1.9192 ns]
                        change: [−63.378% −63.235% −63.074%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  7 (7.00%) high mild
  10 (10.00%) high severe

@dvdplm dvdplm changed the title Faster schoolbook multiplication for sparse inputs. Faster schoolbook multiplication for sparse inputs Jul 10, 2025
@tarcieri
Copy link
Member

If the current LHS limb is zero, skip the iteration.
If the current RHS limb is zero, speculatively check if the remaining RHS limbs are also zero and if they are, fast-forward the loop.

I assume this is about prospective vartime multiplication? That certainly doesn't sound constant-time!

@dvdplm
Copy link
Contributor Author

dvdplm commented Jul 11, 2025

I assume this is about prospective vartime multiplication?

Right, but the original isn't constant time either right?

@tarcieri
Copy link
Member

Which implementation are you talking about and what isn’t constant time about it?

@dvdplm
Copy link
Contributor Author

dvdplm commented Jul 11, 2025

You are right, the original code does appear to be CT. Ignore me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants