Faster schoolbook multiplication for sparse inputs #867
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In the end this effort failed, but here's a report on what I tried so others can avoid the effort.
I tried several diffferent approaches:
get_unchecked
and mutation-in-place all over, that sort of thing. Result: a fair bit slower than the original, especially when multiplying numbers of different sizes, e.g. U512 * U256.The version I finally settled on is much faster for "sparse" limbs, ~4x faster, and is about 10% slower for "dense" operands. It keeps the original's loop structure and adds two optimizations:
The results, as hinted to above, are meager. When the inputs contain zero-limbs there's a significant speedup; how much obviously depend on the data but 2x to 8x is the range I have seen in my benchmarks, but this comes at a cost for more commonly seen inputs. For some inputs sizes the difference is small, less than 5%, but sometimes it's plenty larger than that.
When running the whole benchmark suite, the sparse number optimization does show up, yielding more than 5% change in 27 out of 137 tests (mostly speedups, but some bad regressions too), but overall the bag is too mixed and not in the right way.
Overall I do not think this PR should be merged.
Benchmark results
Benchmark results
Comparing the schoolbook multiplication routine of this PR to master. Here "small" means Uints with many empty limbs, showing off the optimization proposed.