-
Notifications
You must be signed in to change notification settings - Fork 6
add qualcomm 8cx gen 3 benchmark + minor tuning #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Vector128<byte> block2 = AdvSimd.LoadVector128(pInputBuffer + processedLength + localasciirun + 16); | ||
Vector128<byte> block3 = AdvSimd.LoadVector128(pInputBuffer + processedLength + localasciirun + 32); | ||
Vector128<byte> block4 = AdvSimd.LoadVector128(pInputBuffer + processedLength + localasciirun + 48); | ||
Vector128<byte> or = AdvSimd.Or(AdvSimd.Or(block1, block2), AdvSimd.Or(block3, block4)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that CSE does not perform as well as it could here, blocking the peephole to coalesce the loads. Note that in this specific case using platform intrinsic does not have a difference with plain Vector128.Load
:
byte* pAsciiBuffer = pInputBuffer + asciirun;
Vector128<byte> block1 = Vector128.Load(pAsciiBuffer);
Vector128<byte> block2 = Vector128.Load(pAsciiBuffer + 16);
Vector128<byte> block3 = Vector128.Load(pAsciiBuffer + 32);
Vector128<byte> block4 = Vector128.Load(pAsciiBuffer + 48);
Before:
G_M146_IG06: ;; offset=0x0048
sxtw x5, w4
ldr q16, [x0, x5]
add x6, x0, x5
ldr q17, [x6, #0x10]
orr v16.16b, v16.16b, v17.16b
add x6, x0, x5
ldr q17, [x6, #0x20]
add x5, x0, x5
ldr q18, [x5, #0x30]
orr v17.16b, v17.16b, v18.16b
orr v16.16b, v16.16b, v17.16b
umaxv b16, v16.16b
umov w5, v16.b[0]
cmp w5, #127
bgt G_M146_IG07
add w4, w4, #64
add w5, w4, #64
cmp w5, w1
ble G_M146_IG06
After:
G_M146_IG06: ;; offset=0x0048
add x5, x0, w4, SXTW
ldp q16, q17, [x5]
orr v16.16b, v16.16b, v17.16b
ldp q17, q18, [x5, #0x20]
orr v17.16b, v17.16b, v18.16b
orr v16.16b, v16.16b, v17.16b
umaxv b16, v16.16b
umov w5, v16.b[0]
cmp w5, #127
bgt G_M146_IG07
add w4, w4, #64
add w5, w4, #64
cmp w5, w1
ble G_M146_IG06
Does not matter much probably but one step away from perfect codegen :)
cc @EgorBo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@neon-sunset Oddly enough, on my Apple M2, I am getting better results (consistently) with the code as it is... If you have ARM hardware, can you try running...
dotnet run --configuration Release --filter "*Latin*"
I am getting rock solid better results with the current code.
I have not looked at the generated code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does not matter much probably
Turns out this assumption was completely wrong - with the change now hits 100 GiB/s mark as measured by the benchmark suite.
BenchmarkDotNet v0.13.12, macOS 15.0 (24A5264n) [Darwin 24.0.0]
Apple M1 Pro, 1 CPU, 8 logical and 8 physical cores
.NET SDK 9.0.100-preview.6.24314.10
[Host] : .NET 9.0.0 (9.0.24.30702), Arm64 RyuJIT AdvSIMD
Job-WPCVMR : .NET 9.0.0 (9.0.24.30702), Arm64 RyuJIT AdvSIMD
MaxIterationCount=10 MinIterationCount=2 WarmupCount=1
Method | FileName | Mean | Error | StdDev | Speed (GB/s) |
---|---|---|---|---|---|
SIMDUtf8ValidationRealData-Baseline | data/Latin-Lipsum.utf8.txt | 1.026 us | 0.0076 us | 0.0012 us | 84.75 |
SIMDUtf8ValidationRealData-Change | data/Latin-Lipsum.utf8.txt | 868.0 ns | 5.39 ns | 0.83 ns | 100.16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you game to do a PR?
I see you are using .NET 9, this could be the difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you game to do a PR?
Sure, thanks! Will look through implementation to see if there is anything else that stands out.
For the Latin scenario, .NET 9 here shouldn't make significant difference, there were changes regarding loop handling but here the win is purely via merging the loads and removing redundant adds (that might also add a false dependency?) .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, thanks! Will look through implementation to see if there is anything else that stands out.
Please do!!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the Latin scenario, .NET 9 here shouldn't make significant difference, there were changes regarding loop handling but here the win is purely via merging the loads and removing redundant adds (that might also add a false dependency?) .
I installed .NET 9 and I see no difference. Your proposal is still a net negative. But I am going to toy a bit more with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I installed .NET 9 and I see no difference. Your proposal is still a net negative. But I am going to toy a bit more with it.
Is it regressing performance on 8cx gen3 and/or Neoverse N1? I assume it should be a similar improvement on M2...
Edit: managed to reproduce Latin regression when running the full suite. The numbers are overall unstable, will tune, fix and submit findings in a PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume it should be a similar improvement on M2...
I would assume so but it seems that it is not the case.
I am going to merge this by PRs are invited (@neon-sunset). |
Should be self-explanatory. I am focusing on ARM performance.