Skip to content

add qualcomm 8cx gen 3 benchmark + minor tuning #48

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 25, 2024
Merged

Conversation

lemire
Copy link
Member

@lemire lemire commented Jun 24, 2024

Should be self-explanatory. I am focusing on ARM performance.

Copy link
Collaborator

@Nick-Nuon Nick-Nuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Vector128<byte> block2 = AdvSimd.LoadVector128(pInputBuffer + processedLength + localasciirun + 16);
Vector128<byte> block3 = AdvSimd.LoadVector128(pInputBuffer + processedLength + localasciirun + 32);
Vector128<byte> block4 = AdvSimd.LoadVector128(pInputBuffer + processedLength + localasciirun + 48);
Vector128<byte> or = AdvSimd.Or(AdvSimd.Or(block1, block2), AdvSimd.Or(block3, block4));
Copy link

@neon-sunset neon-sunset Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that CSE does not perform as well as it could here, blocking the peephole to coalesce the loads. Note that in this specific case using platform intrinsic does not have a difference with plain Vector128.Load:

byte* pAsciiBuffer = pInputBuffer + asciirun;
Vector128<byte> block1 = Vector128.Load(pAsciiBuffer);
Vector128<byte> block2 = Vector128.Load(pAsciiBuffer + 16);
Vector128<byte> block3 = Vector128.Load(pAsciiBuffer + 32);
Vector128<byte> block4 = Vector128.Load(pAsciiBuffer + 48);

Before:

G_M146_IG06:  ;; offset=0x0048
            sxtw    x5, w4
            ldr     q16, [x0, x5]
            add     x6, x0, x5
            ldr     q17, [x6, #0x10]
            orr     v16.16b, v16.16b, v17.16b
            add     x6, x0, x5
            ldr     q17, [x6, #0x20]
            add     x5, x0, x5
            ldr     q18, [x5, #0x30]
            orr     v17.16b, v17.16b, v18.16b
            orr     v16.16b, v16.16b, v17.16b
            umaxv   b16, v16.16b
            umov    w5, v16.b[0]
            cmp     w5, #127
            bgt     G_M146_IG07
            add     w4, w4, #64
            add     w5, w4, #64
            cmp     w5, w1
            ble     G_M146_IG06

After:

G_M146_IG06:  ;; offset=0x0048
            add     x5, x0, w4, SXTW
            ldp     q16, q17, [x5]
            orr     v16.16b, v16.16b, v17.16b
            ldp     q17, q18, [x5, #0x20]
            orr     v17.16b, v17.16b, v18.16b
            orr     v16.16b, v16.16b, v17.16b
            umaxv   b16, v16.16b
            umov    w5, v16.b[0]
            cmp     w5, #127
            bgt     G_M146_IG07
            add     w4, w4, #64
            add     w5, w4, #64
            cmp     w5, w1
            ble     G_M146_IG06

Does not matter much probably but one step away from perfect codegen :)
cc @EgorBo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neon-sunset Oddly enough, on my Apple M2, I am getting better results (consistently) with the code as it is... If you have ARM hardware, can you try running...

 dotnet run --configuration Release --filter "*Latin*"

I am getting rock solid better results with the current code.

I have not looked at the generated code.

Copy link

@neon-sunset neon-sunset Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not matter much probably

Turns out this assumption was completely wrong - with the change now hits 100 GiB/s mark as measured by the benchmark suite.

BenchmarkDotNet v0.13.12, macOS 15.0 (24A5264n) [Darwin 24.0.0]
Apple M1 Pro, 1 CPU, 8 logical and 8 physical cores
.NET SDK 9.0.100-preview.6.24314.10
  [Host]     : .NET 9.0.0 (9.0.24.30702), Arm64 RyuJIT AdvSIMD
  Job-WPCVMR : .NET 9.0.0 (9.0.24.30702), Arm64 RyuJIT AdvSIMD

MaxIterationCount=10  MinIterationCount=2  WarmupCount=1 
Method FileName Mean Error StdDev Speed (GB/s)
SIMDUtf8ValidationRealData-Baseline data/Latin-Lipsum.utf8.txt 1.026 us 0.0076 us 0.0012 us 84.75
SIMDUtf8ValidationRealData-Change data/Latin-Lipsum.utf8.txt 868.0 ns 5.39 ns 0.83 ns 100.16

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you game to do a PR?

I see you are using .NET 9, this could be the difference?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you game to do a PR?

Sure, thanks! Will look through implementation to see if there is anything else that stands out.

For the Latin scenario, .NET 9 here shouldn't make significant difference, there were changes regarding loop handling but here the win is purely via merging the loads and removing redundant adds (that might also add a false dependency?) .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks! Will look through implementation to see if there is anything else that stands out.

Please do!!!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the Latin scenario, .NET 9 here shouldn't make significant difference, there were changes regarding loop handling but here the win is purely via merging the loads and removing redundant adds (that might also add a false dependency?) .

I installed .NET 9 and I see no difference. Your proposal is still a net negative. But I am going to toy a bit more with it.

Copy link

@neon-sunset neon-sunset Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I installed .NET 9 and I see no difference. Your proposal is still a net negative. But I am going to toy a bit more with it.

Is it regressing performance on 8cx gen3 and/or Neoverse N1? I assume it should be a similar improvement on M2...

Edit: managed to reproduce Latin regression when running the full suite. The numbers are overall unstable, will tune, fix and submit findings in a PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it should be a similar improvement on M2...

I would assume so but it seems that it is not the case.

@lemire
Copy link
Member Author

lemire commented Jun 25, 2024

I am going to merge this by PRs are invited (@neon-sunset).

@lemire lemire merged commit bd4a55a into main Jun 25, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants