Skip to content

Less branchy Udivrem #51

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 14, 2025
Merged

Less branchy Udivrem #51

merged 3 commits into from
May 14, 2025

Conversation

Scooletz
Copy link
Contributor

@Scooletz Scooletz commented May 12, 2025

This PR replaces the first set of ifs with a vector based approach. The resulting method is ~50 bytes lighter, has 4 branches less (58 instead of 62) and is ~1-2 % faster (I don't believe such small numbers and would love to see your checks for it).

Benchmarks

Using AddMod_UInt256 for benchmarks

before

Method EnvironmentVariables C A B Mean Error StdDev Ratio RatioSD Code Size Allocated Alloc Ratio
AddMod_UInt256 Empty (619(...)658) [156] (619(...)658) [156] (619(...)658) [156] 46.182 ns 0.8056 ns 0.7142 ns 1.00 0.02 4,483 B - NA
AddMod_UInt256 DOTNET_EnableHWIntrinsic=0 (619(...)658) [156] (619(...)658) [156] (619(...)658) [156] 81.911 ns 0.4257 ns 0.3774 ns 1.77 0.03 6,280 B - NA
AddMod_UInt256 Empty (619(...)658) [156] (619(...)658) [156] (115(...)935) [160] 56.944 ns 0.6850 ns 0.6408 ns 1.00 0.02 4,057 B - NA
AddMod_UInt256 DOTNET_EnableHWIntrinsic=0 (619(...)658) [156] (619(...)658) [156] (115(...)935) [160] 97.714 ns 0.6452 ns 0.5388 ns 1.72 0.02 5,124 B - NA
AddMod_UInt256 Empty (619(...)658) [156] (115(...)935) [160] (619(...)658) [156] 57.312 ns 0.4676 ns 0.4145 ns 1.00 0.01 4,057 B - NA
AddMod_UInt256 DOTNET_EnableHWIntrinsic=0 (619(...)658) [156] (115(...)935) [160] (619(...)658) [156] 97.625 ns 0.7901 ns 0.7390 ns 1.70 0.02 5,124 B - NA
AddMod_UInt256 Empty (619(...)658) [156] (115(...)935) [160] (115(...)935) [160] 57.197 ns 0.5008 ns 0.4439 ns 1.00 0.01 4,057 B - NA
AddMod_UInt256 DOTNET_EnableHWIntrinsic=0 (619(...)658) [156] (115(...)935) [160] (115(...)935) [160] 97.634 ns 0.5893 ns 0.5512 ns 1.71 0.02 5,124 B - NA
AddMod_UInt256 Empty (115(...)935) [160] (619(...)658) [156] (619(...)658) [156] 6.200 ns 0.1172 ns 0.1096 ns 1.00 0.02 986 B - NA
AddMod_UInt256 DOTNET_EnableHWIntrinsic=0 (115(...)935) [160] (619(...)658) [156] (619(...)658) [156] 19.617 ns 0.3267 ns 0.3056 ns 3.17 0.07 2,280 B - NA
AddMod_UInt256 Empty (115(...)935) [160] (619(...)658) [156] (115(...)935) [160] 56.108 ns 0.4235 ns 0.3754 ns 1.00 0.01 4,057 B - NA
AddMod_UInt256 DOTNET_EnableHWIntrinsic=0 (115(...)935) [160] (619(...)658) [156] (115(...)935) [160] 98.180 ns 0.7490 ns 0.6255 ns 1.75 0.02 5,124 B - NA
AddMod_UInt256 Empty (115(...)935) [160] (115(...)935) [160] (619(...)658) [156] 55.214 ns 0.1613 ns 0.1347 ns 1.00 0.00 4,057 B - NA
AddMod_UInt256 DOTNET_EnableHWIntrinsic=0 (115(...)935) [160] (115(...)935) [160] (619(...)658) [156] 98.969 ns 0.7010 ns 0.6214 ns 1.79 0.01 5,124 B - NA
AddMod_UInt256 Empty (115(...)935) [160] (115(...)935) [160] (115(...)935) [160] 55.759 ns 0.2273 ns 0.1775 ns 1.00 0.00 4,057 B - NA
AddMod_UInt256 DOTNET_EnableHWIntrinsic=0 (115(...)935) [160] (115(...)935) [160] (115(...)935) [160] 98.028 ns 0.7539 ns 0.7052 ns 1.76 0.01 5,124 B - NA

after

| Method         | EnvironmentVariables       | C                   | A                   | B                   | Mean       | Error     | StdDev    | Ratio | RatioSD | Code Size | Allocated | Alloc Ratio |
|--------------- |--------------------------- |-------------------- |-------------------- |-------------------- |-----------:|----------:|----------:|------:|--------:|----------:|----------:|------------:|
| **AddMod_UInt256** | **Empty**                      | **(619(...)658) [156]** | **(619(...)658) [156]** | **(619(...)658) [156]** |  **45.197 ns** | **0.5057 ns** | **0.4731 ns** |  **1.00** |    **0.01** |   **4,430 B** |         **-** |          **NA** |
| AddMod_UInt256 | DOTNET_EnableHWIntrinsic=0 | (619(...)658) [156] | (619(...)658) [156] | (619(...)658) [156] |  81.147 ns | 0.6087 ns | 0.5694 ns |  1.80 |    0.02 |   6,259 B |         - |          NA |
|                |                            |                     |                     |                     |            |           |           |       |         |           |           |             |
| **AddMod_UInt256** | **Empty**                      | **(619(...)658) [156]** | **(619(...)658) [156]** | **(115(...)935) [160]** |  **55.656 ns** | **0.1960 ns** | **0.1637 ns** |  **1.00** |    **0.00** |   **4,004 B** |         **-** |          **NA** |
| AddMod_UInt256 | DOTNET_EnableHWIntrinsic=0 | (619(...)658) [156] | (619(...)658) [156] | (115(...)935) [160] |  98.319 ns | 0.8007 ns | 0.7490 ns |  1.77 |    0.01 |   5,103 B |         - |          NA |
|                |                            |                     |                     |                     |            |           |           |       |         |           |           |             |
| **AddMod_UInt256** | **Empty**                      | **(619(...)658) [156]** | **(115(...)935) [160]** | **(619(...)658) [156]** |  **55.658 ns** | **0.3238 ns** | **0.2704 ns** |  **1.00** |    **0.01** |   **4,004 B** |         **-** |          **NA** |
| AddMod_UInt256 | DOTNET_EnableHWIntrinsic=0 | (619(...)658) [156] | (115(...)935) [160] | (619(...)658) [156] |  97.076 ns | 0.3054 ns | 0.2707 ns |  1.74 |    0.01 |   5,103 B |         - |          NA |
|                |                            |                     |                     |                     |            |           |           |       |         |           |           |             |
| **AddMod_UInt256** | **Empty**                      | **(619(...)658) [156]** | **(115(...)935) [160]** | **(115(...)935) [160]** |  **56.254 ns** | **0.3116 ns** | **0.2914 ns** |  **1.00** |    **0.01** |   **4,004 B** |         **-** |          **NA** |
| AddMod_UInt256 | DOTNET_EnableHWIntrinsic=0 | (619(...)658) [156] | (115(...)935) [160] | (115(...)935) [160] |  97.067 ns | 0.3199 ns | 0.2836 ns |  1.73 |    0.01 |   5,103 B |         - |          NA |
|                |                            |                     |                     |                     |            |           |           |       |         |           |           |             |
| **AddMod_UInt256** | **Empty**                      | **(115(...)935) [160]** | **(619(...)658) [156]** | **(619(...)658) [156]** |   **5.869 ns** | **0.0581 ns** | **0.0515 ns** |  **1.00** |    **0.01** |     **986 B** |         **-** |          **NA** |
| AddMod_UInt256 | DOTNET_EnableHWIntrinsic=0 | (115(...)935) [160] | (619(...)658) [156] | (619(...)658) [156] |  19.572 ns | 0.1608 ns | 0.1426 ns |  3.34 |    0.04 |   2,280 B |         - |          NA |
|                |                            |                     |                     |                     |            |           |           |       |         |           |           |             |
| **AddMod_UInt256** | **Empty**                      | **(115(...)935) [160]** | **(619(...)658) [156]** | **(115(...)935) [160]** |  **55.467 ns** | **0.4505 ns** | **0.3993 ns** |  **1.00** |    **0.01** |   **4,004 B** |         **-** |          **NA** |
| AddMod_UInt256 | DOTNET_EnableHWIntrinsic=0 | (115(...)935) [160] | (619(...)658) [156] | (115(...)935) [160] | 100.209 ns | 0.6543 ns | 0.5801 ns |  1.81 |    0.02 |   5,103 B |         - |          NA |
|                |                            |                     |                     |                     |            |           |           |       |         |           |           |             |
| **AddMod_UInt256** | **Empty**                      | **(115(...)935) [160]** | **(115(...)935) [160]** | **(619(...)658) [156]** |  **55.399 ns** | **0.3073 ns** | **0.2875 ns** |  **1.00** |    **0.01** |   **4,004 B** |         **-** |          **NA** |
| AddMod_UInt256 | DOTNET_EnableHWIntrinsic=0 | (115(...)935) [160] | (115(...)935) [160] | (619(...)658) [156] |  98.620 ns | 0.6510 ns | 0.6090 ns |  1.78 |    0.01 |   5,103 B |         - |          NA |
|                |                            |                     |                     |                     |            |           |           |       |         |           |           |             |
| **AddMod_UInt256** | **Empty**                      | **(115(...)935) [160]** | **(115(...)935) [160]** | **(115(...)935) [160]** |  **55.067 ns** | **0.2233 ns** | **0.1864 ns** |  **1.00** |    **0.00** |   **4,004 B** |         **-** |          **NA** |
| AddMod_UInt256 | DOTNET_EnableHWIntrinsic=0 | (115(...)935) [160] | (115(...)935) [160] | (115(...)935) [160] |  97.706 ns | 0.7141 ns | 0.6680 ns |  1.77 |    0.01 |   5,103 B |         - |          NA |

</details>

@Scooletz Scooletz marked this pull request as draft May 12, 2025 10:29
@Scooletz Scooletz marked this pull request as ready for review May 12, 2025 10:36
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes the UInt256 division method by replacing a series of conditional branches with a vectorized approach, reducing the code size and execution branches while achieving a slight performance improvement.

  • Replaces initial if-statements with vectorized code when hardware acceleration is available.
  • Uses Unsafe.SkipInit and vector operations to compute dLen and shift values.

// Check which is zero
var isZero = Vector256.IsZero(v);

// Use most significant bits, negation and masking with 4 bits to find the most significant set
Copy link
Preview

Copilot AI May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The computation for dLen is quite dense due to the bitwise operations; please add an inline comment explaining the rationale behind using the complement, the extraction of the most significant bits, and the masking with 0b1111.

Suggested change
// Use most significant bits, negation and masking with 4 bits to find the most significant set
// Compute dLen by finding the position of the most significant set bit in the vector.
// ~isZero: Invert the bits of isZero to identify the most significant set bit.
// ExtractMostSignificantBits: Extracts the most significant bits of the vector.
// & 0b1111: Mask with 0b1111 to focus on the lower 4 bits, which represent the relevant portion.

Copilot uses AI. Check for mistakes.

var isZero = Vector256.IsZero(v);

// Use most significant bits, negation and masking with 4 bits to find the most significant set
dLen = 32 - BitOperations.LeadingZeroCount(~isZero.ExtractMostSignificantBits() & 0b1111);
Copy link
Preview

Copilot AI May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider adding a comment to clarify the pointer arithmetic using dLen - 1 and how it correctly targets the intended element in the vector.

Suggested change
dLen = 32 - BitOperations.LeadingZeroCount(~isZero.ExtractMostSignificantBits() & 0b1111);
dLen = 32 - BitOperations.LeadingZeroCount(~isZero.ExtractMostSignificantBits() & 0b1111);
// `dLen` represents the number of non-zero elements in the vector `d`.
// Subtracting 1 targets the most significant non-zero element for the calculation.

Copilot uses AI. Check for mistakes.

Copy link
Member

@LukaszRozmej LukaszRozmej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benaadams WDYT?

var isZero = Vector256.IsZero(v);

// Use most significant bits, negation and masking with 4 bits to find the most significant set
dLen = 32 - BitOperations.LeadingZeroCount(~isZero.ExtractMostSignificantBits() & 0b1111);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit of a wild line; maybe split into multiple with some nice named locals?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did my best to address this @benaadams . Let me know if it's ok now and merge if ok.

@Scooletz Scooletz requested a review from benaadams May 13, 2025 07:53
@Scooletz Scooletz merged commit 3b88940 into main May 14, 2025
4 checks passed
@Scooletz Scooletz deleted the udivrem-vectored branch May 14, 2025 07:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants