add qualcomm 8cx gen 3 benchmark + minor tuning #48

lemire · 2024-06-24T18:39:35Z

Should be self-explanatory. I am focusing on ARM performance.

Nick-Nuon

Looks good to me!

neon-sunset · 2024-06-24T21:30:44Z

src/UTF8.cs

+                                        Vector128<byte> block2 = AdvSimd.LoadVector128(pInputBuffer + processedLength + localasciirun + 16);
+                                        Vector128<byte> block3 = AdvSimd.LoadVector128(pInputBuffer + processedLength + localasciirun + 32);
+                                        Vector128<byte> block4 = AdvSimd.LoadVector128(pInputBuffer + processedLength + localasciirun + 48);
+                                        Vector128<byte> or = AdvSimd.Or(AdvSimd.Or(block1, block2), AdvSimd.Or(block3, block4));


It appears that CSE does not perform as well as it could here, blocking the peephole to coalesce the loads. Note that in this specific case using platform intrinsic does not have a difference with plain Vector128.Load:

byte* pAsciiBuffer = pInputBuffer + asciirun; Vector128<byte> block1 = Vector128.Load(pAsciiBuffer); Vector128<byte> block2 = Vector128.Load(pAsciiBuffer + 16); Vector128<byte> block3 = Vector128.Load(pAsciiBuffer + 32); Vector128<byte> block4 = Vector128.Load(pAsciiBuffer + 48);

Before:

G_M146_IG06: ;; offset=0x0048 sxtw x5, w4 ldr q16, [x0, x5] add x6, x0, x5 ldr q17, [x6, #0x10] orr v16.16b, v16.16b, v17.16b add x6, x0, x5 ldr q17, [x6, #0x20] add x5, x0, x5 ldr q18, [x5, #0x30] orr v17.16b, v17.16b, v18.16b orr v16.16b, v16.16b, v17.16b umaxv b16, v16.16b umov w5, v16.b[0] cmp w5, #127 bgt G_M146_IG07 add w4, w4, #64 add w5, w4, #64 cmp w5, w1 ble G_M146_IG06

After:

G_M146_IG06: ;; offset=0x0048 add x5, x0, w4, SXTW ldp q16, q17, [x5] orr v16.16b, v16.16b, v17.16b ldp q17, q18, [x5, #0x20] orr v17.16b, v17.16b, v18.16b orr v16.16b, v16.16b, v17.16b umaxv b16, v16.16b umov w5, v16.b[0] cmp w5, #127 bgt G_M146_IG07 add w4, w4, #64 add w5, w4, #64 cmp w5, w1 ble G_M146_IG06

Does not matter much probably but one step away from perfect codegen :)
cc @EgorBo

@neon-sunset Oddly enough, on my Apple M2, I am getting better results (consistently) with the code as it is... If you have ARM hardware, can you try running...

dotnet run --configuration Release --filter "*Latin*"

I am getting rock solid better results with the current code.

I have not looked at the generated code.

Does not matter much probably

Turns out this assumption was completely wrong - with the change now hits 100 GiB/s mark as measured by the benchmark suite.

BenchmarkDotNet v0.13.12, macOS 15.0 (24A5264n) [Darwin 24.0.0] Apple M1 Pro, 1 CPU, 8 logical and 8 physical cores .NET SDK 9.0.100-preview.6.24314.10 [Host] : .NET 9.0.0 (9.0.24.30702), Arm64 RyuJIT AdvSIMD Job-WPCVMR : .NET 9.0.0 (9.0.24.30702), Arm64 RyuJIT AdvSIMD MaxIterationCount=10 MinIterationCount=2 WarmupCount=1

Method FileName Mean Error StdDev Speed (GB/s)

SIMDUtf8ValidationRealData-Baseline data/Latin-Lipsum.utf8.txt 1.026 us 0.0076 us 0.0012 us 84.75

SIMDUtf8ValidationRealData-Change data/Latin-Lipsum.utf8.txt 868.0 ns 5.39 ns 0.83 ns 100.16

Are you game to do a PR?

I see you are using .NET 9, this could be the difference?

Are you game to do a PR?

Sure, thanks! Will look through implementation to see if there is anything else that stands out.

For the Latin scenario, .NET 9 here shouldn't make significant difference, there were changes regarding loop handling but here the win is purely via merging the loads and removing redundant adds (that might also add a false dependency?) .

Sure, thanks! Will look through implementation to see if there is anything else that stands out.

Please do!!!

For the Latin scenario, .NET 9 here shouldn't make significant difference, there were changes regarding loop handling but here the win is purely via merging the loads and removing redundant adds (that might also add a false dependency?) .

I installed .NET 9 and I see no difference. Your proposal is still a net negative. But I am going to toy a bit more with it.

I installed .NET 9 and I see no difference. Your proposal is still a net negative. But I am going to toy a bit more with it.

Is it regressing performance on 8cx gen3 and/or Neoverse N1? I assume it should be a similar improvement on M2...

Edit: managed to reproduce Latin regression when running the full suite. The numbers are overall unstable, will tune, fix and submit findings in a PR.

I assume it should be a similar improvement on M2...

I would assume so but it seems that it is not the case.

lemire · 2024-06-25T14:25:33Z

I am going to merge this by PRs are invited (@neon-sunset).

lemire added 3 commits June 24, 2024 14:07

simplify arm code

9fd2b33

some tuning

b08ab44

adding qualcomm results

54d35a9

lemire mentioned this pull request Jun 24, 2024

Try SimdUnicode for Utf8 validation dotnet/runtime#103860

Closed

lemire requested review from EgorBo and Nick-Nuon June 24, 2024 18:43

Nick-Nuon approved these changes Jun 24, 2024

View reviewed changes

neon-sunset reviewed Jun 24, 2024

View reviewed changes

lemire merged commit bd4a55a into main Jun 25, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add qualcomm 8cx gen 3 benchmark + minor tuning #48

add qualcomm 8cx gen 3 benchmark + minor tuning #48

Uh oh!

lemire commented Jun 24, 2024 •

edited

Loading

Uh oh!

Nick-Nuon left a comment

Uh oh!

neon-sunset Jun 24, 2024 •

edited

Loading

Uh oh!

lemire Jun 24, 2024

Uh oh!

neon-sunset Jun 25, 2024 •

edited

Loading

Uh oh!

lemire Jun 25, 2024

Uh oh!

neon-sunset Jun 25, 2024

Uh oh!

lemire Jun 25, 2024

Uh oh!

lemire Jun 25, 2024

Uh oh!

neon-sunset Jun 25, 2024 •

edited

Loading

Uh oh!

lemire Jun 25, 2024

Uh oh!

lemire commented Jun 25, 2024

Uh oh!

Uh oh!

Uh oh!

Method	FileName	Mean	Error	StdDev	Speed (GB/s)
SIMDUtf8ValidationRealData-Baseline	data/Latin-Lipsum.utf8.txt	1.026 us	0.0076 us	0.0012 us	84.75
SIMDUtf8ValidationRealData-Change	data/Latin-Lipsum.utf8.txt	868.0 ns	5.39 ns	0.83 ns	100.16

add qualcomm 8cx gen 3 benchmark + minor tuning #48

add qualcomm 8cx gen 3 benchmark + minor tuning #48

Uh oh!

Conversation

lemire commented Jun 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nick-Nuon left a comment

Choose a reason for hiding this comment

Uh oh!

neon-sunset Jun 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lemire Jun 24, 2024

Choose a reason for hiding this comment

Uh oh!

neon-sunset Jun 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lemire Jun 25, 2024

Choose a reason for hiding this comment

Uh oh!

neon-sunset Jun 25, 2024

Choose a reason for hiding this comment

Uh oh!

lemire Jun 25, 2024

Choose a reason for hiding this comment

Uh oh!

lemire Jun 25, 2024

Choose a reason for hiding this comment

Uh oh!

neon-sunset Jun 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lemire Jun 25, 2024

Choose a reason for hiding this comment

Uh oh!

lemire commented Jun 25, 2024

Uh oh!

Uh oh!

Uh oh!

lemire commented Jun 24, 2024 •

edited

Loading

neon-sunset Jun 24, 2024 •

edited

Loading

neon-sunset Jun 25, 2024 •

edited

Loading

neon-sunset Jun 25, 2024 •

edited

Loading