Less branchy Udivrem #51

Scooletz · 2025-05-12T10:29:01Z

This PR replaces the first set of ifs with a vector based approach. The resulting method is ~50 bytes lighter, has 4 branches less (58 instead of 62) and is ~1-2 % faster (I don't believe such small numbers and would love to see your checks for it).

Benchmarks

Using AddMod_UInt256 for benchmarks

before

Method	EnvironmentVariables	C	A	B	Mean	Error	StdDev	Ratio	RatioSD	Code Size	Allocated	Alloc Ratio
AddMod_UInt256	Empty	(619(...)658) [156]	(619(...)658) [156]	(619(...)658) [156]	46.182 ns	0.8056 ns	0.7142 ns	1.00	0.02	4,483 B	-	NA
AddMod_UInt256	DOTNET_EnableHWIntrinsic=0	(619(...)658) [156]	(619(...)658) [156]	(619(...)658) [156]	81.911 ns	0.4257 ns	0.3774 ns	1.77	0.03	6,280 B	-	NA

AddMod_UInt256	Empty	(619(...)658) [156]	(619(...)658) [156]	(115(...)935) [160]	56.944 ns	0.6850 ns	0.6408 ns	1.00	0.02	4,057 B	-	NA
AddMod_UInt256	DOTNET_EnableHWIntrinsic=0	(619(...)658) [156]	(619(...)658) [156]	(115(...)935) [160]	97.714 ns	0.6452 ns	0.5388 ns	1.72	0.02	5,124 B	-	NA

AddMod_UInt256	Empty	(619(...)658) [156]	(115(...)935) [160]	(619(...)658) [156]	57.312 ns	0.4676 ns	0.4145 ns	1.00	0.01	4,057 B	-	NA
AddMod_UInt256	DOTNET_EnableHWIntrinsic=0	(619(...)658) [156]	(115(...)935) [160]	(619(...)658) [156]	97.625 ns	0.7901 ns	0.7390 ns	1.70	0.02	5,124 B	-	NA

AddMod_UInt256	Empty	(619(...)658) [156]	(115(...)935) [160]	(115(...)935) [160]	57.197 ns	0.5008 ns	0.4439 ns	1.00	0.01	4,057 B	-	NA
AddMod_UInt256	DOTNET_EnableHWIntrinsic=0	(619(...)658) [156]	(115(...)935) [160]	(115(...)935) [160]	97.634 ns	0.5893 ns	0.5512 ns	1.71	0.02	5,124 B	-	NA

AddMod_UInt256	Empty	(115(...)935) [160]	(619(...)658) [156]	(619(...)658) [156]	6.200 ns	0.1172 ns	0.1096 ns	1.00	0.02	986 B	-	NA
AddMod_UInt256	DOTNET_EnableHWIntrinsic=0	(115(...)935) [160]	(619(...)658) [156]	(619(...)658) [156]	19.617 ns	0.3267 ns	0.3056 ns	3.17	0.07	2,280 B	-	NA

AddMod_UInt256	Empty	(115(...)935) [160]	(619(...)658) [156]	(115(...)935) [160]	56.108 ns	0.4235 ns	0.3754 ns	1.00	0.01	4,057 B	-	NA
AddMod_UInt256	DOTNET_EnableHWIntrinsic=0	(115(...)935) [160]	(619(...)658) [156]	(115(...)935) [160]	98.180 ns	0.7490 ns	0.6255 ns	1.75	0.02	5,124 B	-	NA

AddMod_UInt256	Empty	(115(...)935) [160]	(115(...)935) [160]	(619(...)658) [156]	55.214 ns	0.1613 ns	0.1347 ns	1.00	0.00	4,057 B	-	NA
AddMod_UInt256	DOTNET_EnableHWIntrinsic=0	(115(...)935) [160]	(115(...)935) [160]	(619(...)658) [156]	98.969 ns	0.7010 ns	0.6214 ns	1.79	0.01	5,124 B	-	NA

AddMod_UInt256	Empty	(115(...)935) [160]	(115(...)935) [160]	(115(...)935) [160]	55.759 ns	0.2273 ns	0.1775 ns	1.00	0.00	4,057 B	-	NA
AddMod_UInt256	DOTNET_EnableHWIntrinsic=0	(115(...)935) [160]	(115(...)935) [160]	(115(...)935) [160]	98.028 ns	0.7539 ns	0.7052 ns	1.76	0.01	5,124 B	-	NA

after

| Method         | EnvironmentVariables       | C                   | A                   | B                   | Mean       | Error     | StdDev    | Ratio | RatioSD | Code Size | Allocated | Alloc Ratio |
|--------------- |--------------------------- |-------------------- |-------------------- |-------------------- |-----------:|----------:|----------:|------:|--------:|----------:|----------:|------------:|
| **AddMod_UInt256** | **Empty**                      | **(619(...)658) [156]** | **(619(...)658) [156]** | **(619(...)658) [156]** |  **45.197 ns** | **0.5057 ns** | **0.4731 ns** |  **1.00** |    **0.01** |   **4,430 B** |         **-** |          **NA** |
| AddMod_UInt256 | DOTNET_EnableHWIntrinsic=0 | (619(...)658) [156] | (619(...)658) [156] | (619(...)658) [156] |  81.147 ns | 0.6087 ns | 0.5694 ns |  1.80 |    0.02 |   6,259 B |         - |          NA |
|                |                            |                     |                     |                     |            |           |           |       |         |           |           |             |
| **AddMod_UInt256** | **Empty**                      | **(619(...)658) [156]** | **(619(...)658) [156]** | **(115(...)935) [160]** |  **55.656 ns** | **0.1960 ns** | **0.1637 ns** |  **1.00** |    **0.00** |   **4,004 B** |         **-** |          **NA** |
| AddMod_UInt256 | DOTNET_EnableHWIntrinsic=0 | (619(...)658) [156] | (619(...)658) [156] | (115(...)935) [160] |  98.319 ns | 0.8007 ns | 0.7490 ns |  1.77 |    0.01 |   5,103 B |         - |          NA |
|                |                            |                     |                     |                     |            |           |           |       |         |           |           |             |
| **AddMod_UInt256** | **Empty**                      | **(619(...)658) [156]** | **(115(...)935) [160]** | **(619(...)658) [156]** |  **55.658 ns** | **0.3238 ns** | **0.2704 ns** |  **1.00** |    **0.01** |   **4,004 B** |         **-** |          **NA** |
| AddMod_UInt256 | DOTNET_EnableHWIntrinsic=0 | (619(...)658) [156] | (115(...)935) [160] | (619(...)658) [156] |  97.076 ns | 0.3054 ns | 0.2707 ns |  1.74 |    0.01 |   5,103 B |         - |          NA |
|                |                            |                     |                     |                     |            |           |           |       |         |           |           |             |
| **AddMod_UInt256** | **Empty**                      | **(619(...)658) [156]** | **(115(...)935) [160]** | **(115(...)935) [160]** |  **56.254 ns** | **0.3116 ns** | **0.2914 ns** |  **1.00** |    **0.01** |   **4,004 B** |         **-** |          **NA** |
| AddMod_UInt256 | DOTNET_EnableHWIntrinsic=0 | (619(...)658) [156] | (115(...)935) [160] | (115(...)935) [160] |  97.067 ns | 0.3199 ns | 0.2836 ns |  1.73 |    0.01 |   5,103 B |         - |          NA |
|                |                            |                     |                     |                     |            |           |           |       |         |           |           |             |
| **AddMod_UInt256** | **Empty**                      | **(115(...)935) [160]** | **(619(...)658) [156]** | **(619(...)658) [156]** |   **5.869 ns** | **0.0581 ns** | **0.0515 ns** |  **1.00** |    **0.01** |     **986 B** |         **-** |          **NA** |
| AddMod_UInt256 | DOTNET_EnableHWIntrinsic=0 | (115(...)935) [160] | (619(...)658) [156] | (619(...)658) [156] |  19.572 ns | 0.1608 ns | 0.1426 ns |  3.34 |    0.04 |   2,280 B |         - |          NA |
|                |                            |                     |                     |                     |            |           |           |       |         |           |           |             |
| **AddMod_UInt256** | **Empty**                      | **(115(...)935) [160]** | **(619(...)658) [156]** | **(115(...)935) [160]** |  **55.467 ns** | **0.4505 ns** | **0.3993 ns** |  **1.00** |    **0.01** |   **4,004 B** |         **-** |          **NA** |
| AddMod_UInt256 | DOTNET_EnableHWIntrinsic=0 | (115(...)935) [160] | (619(...)658) [156] | (115(...)935) [160] | 100.209 ns | 0.6543 ns | 0.5801 ns |  1.81 |    0.02 |   5,103 B |         - |          NA |
|                |                            |                     |                     |                     |            |           |           |       |         |           |           |             |
| **AddMod_UInt256** | **Empty**                      | **(115(...)935) [160]** | **(115(...)935) [160]** | **(619(...)658) [156]** |  **55.399 ns** | **0.3073 ns** | **0.2875 ns** |  **1.00** |    **0.01** |   **4,004 B** |         **-** |          **NA** |
| AddMod_UInt256 | DOTNET_EnableHWIntrinsic=0 | (115(...)935) [160] | (115(...)935) [160] | (619(...)658) [156] |  98.620 ns | 0.6510 ns | 0.6090 ns |  1.78 |    0.01 |   5,103 B |         - |          NA |
|                |                            |                     |                     |                     |            |           |           |       |         |           |           |             |
| **AddMod_UInt256** | **Empty**                      | **(115(...)935) [160]** | **(115(...)935) [160]** | **(115(...)935) [160]** |  **55.067 ns** | **0.2233 ns** | **0.1864 ns** |  **1.00** |    **0.00** |   **4,004 B** |         **-** |          **NA** |
| AddMod_UInt256 | DOTNET_EnableHWIntrinsic=0 | (115(...)935) [160] | (115(...)935) [160] | (115(...)935) [160] |  97.706 ns | 0.7141 ns | 0.6680 ns |  1.77 |    0.01 |   5,103 B |         - |          NA |

</details>

Copilot

Pull Request Overview

This PR optimizes the UInt256 division method by replacing a series of conditional branches with a vectorized approach, reducing the code size and execution branches while achieving a slight performance improvement.

Replaces initial if-statements with vectorized code when hardware acceleration is available.
Uses Unsafe.SkipInit and vector operations to compute dLen and shift values.

Copilot · 2025-05-12T10:42:04Z

src/Nethermind.Int256/UInt256.cs

+                // Check which is zero
+                var isZero = Vector256.IsZero(v);
+
+                // Use most significant bits, negation and masking with 4 bits to find the most significant set


The computation for dLen is quite dense due to the bitwise operations; please add an inline comment explaining the rationale behind using the complement, the extraction of the most significant bits, and the masking with 0b1111.

Suggested change

// Use most significant bits, negation and masking with 4 bits to find the most significant set

// Compute dLen by finding the position of the most significant set bit in the vector.

// ~isZero: Invert the bits of isZero to identify the most significant set bit.

// ExtractMostSignificantBits: Extracts the most significant bits of the vector.

// & 0b1111: Mask with 0b1111 to focus on the lower 4 bits, which represent the relevant portion.

Copilot · 2025-05-12T10:42:05Z

src/Nethermind.Int256/UInt256.cs

+                var isZero = Vector256.IsZero(v);
+
+                // Use most significant bits, negation and masking with 4 bits to find the most significant set
+                dLen = 32 - BitOperations.LeadingZeroCount(~isZero.ExtractMostSignificantBits() & 0b1111);


[nitpick] Consider adding a comment to clarify the pointer arithmetic using dLen - 1 and how it correctly targets the intended element in the vector.

Suggested change

dLen = 32 - BitOperations.LeadingZeroCount(~isZero.ExtractMostSignificantBits() & 0b1111);

dLen = 32 - BitOperations.LeadingZeroCount(~isZero.ExtractMostSignificantBits() & 0b1111);

// `dLen` represents the number of non-zero elements in the vector `d`.

// Subtracting 1 targets the most significant non-zero element for the calculation.

LukaszRozmej

@benaadams WDYT?

benaadams · 2025-05-12T15:15:45Z

src/Nethermind.Int256/UInt256.cs

+                var isZero = Vector256.IsZero(v);
+
+                // Use most significant bits, negation and masking with 4 bits to find the most significant set
+                dLen = 32 - BitOperations.LeadingZeroCount(~isZero.ExtractMostSignificantBits() & 0b1111);


This is a bit of a wild line; maybe split into multiple with some nice named locals?

Did my best to address this @benaadams . Let me know if it's ok now and merge if ok.

branch less Udivrem

2efb7f7

Scooletz marked this pull request as draft May 12, 2025 10:29

Scooletz marked this pull request as ready for review May 12, 2025 10:36

some comments

0e90639

Scooletz requested review from LukaszRozmej, Copilot and benaadams May 12, 2025 10:41

Copilot AI reviewed May 12, 2025

View reviewed changes

LukaszRozmej approved these changes May 12, 2025

View reviewed changes

benaadams reviewed May 12, 2025

View reviewed changes

benaadams approved these changes May 12, 2025

View reviewed changes

readability

8b754eb

Scooletz requested a review from benaadams May 13, 2025 07:53

benaadams approved these changes May 14, 2025

View reviewed changes

Scooletz merged commit 3b88940 into main May 14, 2025
4 checks passed

Scooletz deleted the udivrem-vectored branch May 14, 2025 07:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Less branchy Udivrem #51

Less branchy Udivrem #51

Uh oh!

Scooletz commented May 12, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 12, 2025

Uh oh!

Copilot AI May 12, 2025

Uh oh!

LukaszRozmej left a comment

Uh oh!

benaadams May 12, 2025

Uh oh!

Scooletz May 13, 2025

Uh oh!

Uh oh!

Uh oh!

-                // Use most significant bits, negation and masking with 4 bits to find the most significant set
+                // Compute dLen by finding the position of the most significant set bit in the vector.
+                // ~isZero: Invert the bits of isZero to identify the most significant set bit.
+                // ExtractMostSignificantBits: Extracts the most significant bits of the vector.
+                // & 0b1111: Mask with 0b1111 to focus on the lower 4 bits, which represent the relevant portion.

Less branchy Udivrem #51

Less branchy Udivrem #51

Uh oh!

Conversation

Scooletz commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

before

after

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI May 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 12, 2025

Choose a reason for hiding this comment

Uh oh!

LukaszRozmej left a comment

Choose a reason for hiding this comment

Uh oh!

benaadams May 12, 2025

Choose a reason for hiding this comment

Uh oh!

Scooletz May 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Scooletz commented May 12, 2025 •

edited

Loading