-
Notifications
You must be signed in to change notification settings - Fork 10
Less branchy Udivrem #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR optimizes the UInt256 division method by replacing a series of conditional branches with a vectorized approach, reducing the code size and execution branches while achieving a slight performance improvement.
- Replaces initial if-statements with vectorized code when hardware acceleration is available.
- Uses Unsafe.SkipInit and vector operations to compute dLen and shift values.
src/Nethermind.Int256/UInt256.cs
Outdated
// Check which is zero | ||
var isZero = Vector256.IsZero(v); | ||
|
||
// Use most significant bits, negation and masking with 4 bits to find the most significant set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The computation for dLen is quite dense due to the bitwise operations; please add an inline comment explaining the rationale behind using the complement, the extraction of the most significant bits, and the masking with 0b1111.
// Use most significant bits, negation and masking with 4 bits to find the most significant set | |
// Compute dLen by finding the position of the most significant set bit in the vector. | |
// ~isZero: Invert the bits of isZero to identify the most significant set bit. | |
// ExtractMostSignificantBits: Extracts the most significant bits of the vector. | |
// & 0b1111: Mask with 0b1111 to focus on the lower 4 bits, which represent the relevant portion. |
Copilot uses AI. Check for mistakes.
src/Nethermind.Int256/UInt256.cs
Outdated
var isZero = Vector256.IsZero(v); | ||
|
||
// Use most significant bits, negation and masking with 4 bits to find the most significant set | ||
dLen = 32 - BitOperations.LeadingZeroCount(~isZero.ExtractMostSignificantBits() & 0b1111); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider adding a comment to clarify the pointer arithmetic using dLen - 1 and how it correctly targets the intended element in the vector.
dLen = 32 - BitOperations.LeadingZeroCount(~isZero.ExtractMostSignificantBits() & 0b1111); | |
dLen = 32 - BitOperations.LeadingZeroCount(~isZero.ExtractMostSignificantBits() & 0b1111); | |
// `dLen` represents the number of non-zero elements in the vector `d`. | |
// Subtracting 1 targets the most significant non-zero element for the calculation. |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benaadams WDYT?
src/Nethermind.Int256/UInt256.cs
Outdated
var isZero = Vector256.IsZero(v); | ||
|
||
// Use most significant bits, negation and masking with 4 bits to find the most significant set | ||
dLen = 32 - BitOperations.LeadingZeroCount(~isZero.ExtractMostSignificantBits() & 0b1111); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit of a wild line; maybe split into multiple with some nice named locals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did my best to address this @benaadams . Let me know if it's ok now and merge if ok.
This PR replaces the first set of
ifs
with a vector based approach. The resulting method is ~50 bytes lighter, has 4 branches less (58 instead of 62) and is ~1-2 % faster (I don't believe such small numbers and would love to see your checks for it).Benchmarks
Using
AddMod_UInt256
for benchmarksbefore
after