Skip to content

add qualcomm 8cx gen 3 benchmark + minor tuning #48

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,22 @@ faster than the standard library.
| Russian-Lipsum | 3.3 | 0.95 | 3.5 x |


On a Qualcomm 8cx gen3 (Windows Dev Kit 2023), we get roughly the same relative performance
boost as the Neoverse V1.

| data set | SimdUnicode speed (GB/s) | .NET speed (GB/s) | speed up |
|:----------------|:-----------|:--------------------------|:-------------------|
| Twitter.json | 15 | 10 | 1.5 x |
| Arabic-Lipsum | 4.0 | 2.3 | 1.7 x |
| Chinese-Lipsum | 4.0 | 2.9 | 1.4 x |
| Emoji-Lipsum | 4.0 | 0.9 | 4.4 x |
| Hebrew-Lipsum | 4.0 | 2.3 | 1.7 x |
| Hindi-Lipsum | 4.0 | 1.9 | 2.1 x |
| Japanese-Lipsum | 4.0 | 2.7  | 1.5 x |
| Korean-Lipsum | 4.0 | 1.5 | 2.7 x |
| Latin-Lipsum | 50 | 20 | 2.5 x |
| Russian-Lipsum | 4.0 | 1.2 | 3.3 x |

One difficulty with ARM processors is that they have varied SIMD/NEON performance. For example, Neoverse N1 processors, not to be confused with the Neoverse V1 design used by AWS Graviton 3, have weak SIMD performance. Of course, one can pick and choose which approach is best and it is not necessary to apply SimdUnicode is all cases. We expect good performance on recent ARM-based Qualcomm processors.

## Building the library
Expand Down
26 changes: 17 additions & 9 deletions src/UTF8.cs
Original file line number Diff line number Diff line change
Expand Up @@ -1388,20 +1388,28 @@ private unsafe static (int utfadjust, int scalaradjust) calculateErrorPathadjust
prevIncomplete = Vector128<byte>.Zero;
// Often, we have a lot of ASCII characters in a row.
int localasciirun = 16;
if (processedLength + localasciirun + 64 <= inputLength)
if (processedLength + localasciirun + 16 <= inputLength)
{
for (; processedLength + localasciirun + 64 <= inputLength; localasciirun += 64)
Vector128<byte> block = AdvSimd.LoadVector128(pInputBuffer + processedLength + localasciirun);
if (AdvSimd.Arm64.MaxAcross(Vector128.AsUInt32(AdvSimd.And(block, v80))).ToScalar() == 0)
{
Vector128<byte> block1 = AdvSimd.LoadVector128(pInputBuffer + processedLength + localasciirun);
Vector128<byte> block2 = AdvSimd.LoadVector128(pInputBuffer + processedLength + localasciirun + 16);
Vector128<byte> block3 = AdvSimd.LoadVector128(pInputBuffer + processedLength + localasciirun + 32);
Vector128<byte> block4 = AdvSimd.LoadVector128(pInputBuffer + processedLength + localasciirun + 48);
Vector128<byte> or = AdvSimd.Or(AdvSimd.Or(block1, block2), AdvSimd.Or(block3, block4));
if (AdvSimd.Arm64.MaxAcross(or).ToScalar() > 127)
localasciirun += 16;
for (; processedLength + localasciirun + 64 <= inputLength; localasciirun += 64)
{
break;
Vector128<byte> block1 = AdvSimd.LoadVector128(pInputBuffer + processedLength + localasciirun);
Vector128<byte> block2 = AdvSimd.LoadVector128(pInputBuffer + processedLength + localasciirun + 16);
Vector128<byte> block3 = AdvSimd.LoadVector128(pInputBuffer + processedLength + localasciirun + 32);
Vector128<byte> block4 = AdvSimd.LoadVector128(pInputBuffer + processedLength + localasciirun + 48);
Vector128<byte> or = AdvSimd.Or(AdvSimd.Or(block1, block2), AdvSimd.Or(block3, block4));
Copy link

@neon-sunset neon-sunset Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that CSE does not perform as well as it could here, blocking the peephole to coalesce the loads. Note that in this specific case using platform intrinsic does not have a difference with plain Vector128.Load:

byte* pAsciiBuffer = pInputBuffer + asciirun;
Vector128<byte> block1 = Vector128.Load(pAsciiBuffer);
Vector128<byte> block2 = Vector128.Load(pAsciiBuffer + 16);
Vector128<byte> block3 = Vector128.Load(pAsciiBuffer + 32);
Vector128<byte> block4 = Vector128.Load(pAsciiBuffer + 48);

Before:

G_M146_IG06:  ;; offset=0x0048
            sxtw    x5, w4
            ldr     q16, [x0, x5]
            add     x6, x0, x5
            ldr     q17, [x6, #0x10]
            orr     v16.16b, v16.16b, v17.16b
            add     x6, x0, x5
            ldr     q17, [x6, #0x20]
            add     x5, x0, x5
            ldr     q18, [x5, #0x30]
            orr     v17.16b, v17.16b, v18.16b
            orr     v16.16b, v16.16b, v17.16b
            umaxv   b16, v16.16b
            umov    w5, v16.b[0]
            cmp     w5, #127
            bgt     G_M146_IG07
            add     w4, w4, #64
            add     w5, w4, #64
            cmp     w5, w1
            ble     G_M146_IG06

After:

G_M146_IG06:  ;; offset=0x0048
            add     x5, x0, w4, SXTW
            ldp     q16, q17, [x5]
            orr     v16.16b, v16.16b, v17.16b
            ldp     q17, q18, [x5, #0x20]
            orr     v17.16b, v17.16b, v18.16b
            orr     v16.16b, v16.16b, v17.16b
            umaxv   b16, v16.16b
            umov    w5, v16.b[0]
            cmp     w5, #127
            bgt     G_M146_IG07
            add     w4, w4, #64
            add     w5, w4, #64
            cmp     w5, w1
            ble     G_M146_IG06

Does not matter much probably but one step away from perfect codegen :)
cc @EgorBo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neon-sunset Oddly enough, on my Apple M2, I am getting better results (consistently) with the code as it is... If you have ARM hardware, can you try running...

 dotnet run --configuration Release --filter "*Latin*"

I am getting rock solid better results with the current code.

I have not looked at the generated code.

Copy link

@neon-sunset neon-sunset Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not matter much probably

Turns out this assumption was completely wrong - with the change now hits 100 GiB/s mark as measured by the benchmark suite.

BenchmarkDotNet v0.13.12, macOS 15.0 (24A5264n) [Darwin 24.0.0]
Apple M1 Pro, 1 CPU, 8 logical and 8 physical cores
.NET SDK 9.0.100-preview.6.24314.10
  [Host]     : .NET 9.0.0 (9.0.24.30702), Arm64 RyuJIT AdvSIMD
  Job-WPCVMR : .NET 9.0.0 (9.0.24.30702), Arm64 RyuJIT AdvSIMD

MaxIterationCount=10  MinIterationCount=2  WarmupCount=1 
Method FileName Mean Error StdDev Speed (GB/s)
SIMDUtf8ValidationRealData-Baseline data/Latin-Lipsum.utf8.txt 1.026 us 0.0076 us 0.0012 us 84.75
SIMDUtf8ValidationRealData-Change data/Latin-Lipsum.utf8.txt 868.0 ns 5.39 ns 0.83 ns 100.16

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you game to do a PR?

I see you are using .NET 9, this could be the difference?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you game to do a PR?

Sure, thanks! Will look through implementation to see if there is anything else that stands out.

For the Latin scenario, .NET 9 here shouldn't make significant difference, there were changes regarding loop handling but here the win is purely via merging the loads and removing redundant adds (that might also add a false dependency?) .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks! Will look through implementation to see if there is anything else that stands out.

Please do!!!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the Latin scenario, .NET 9 here shouldn't make significant difference, there were changes regarding loop handling but here the win is purely via merging the loads and removing redundant adds (that might also add a false dependency?) .

I installed .NET 9 and I see no difference. Your proposal is still a net negative. But I am going to toy a bit more with it.

Copy link

@neon-sunset neon-sunset Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I installed .NET 9 and I see no difference. Your proposal is still a net negative. But I am going to toy a bit more with it.

Is it regressing performance on 8cx gen3 and/or Neoverse N1? I assume it should be a similar improvement on M2...

Edit: managed to reproduce Latin regression when running the full suite. The numbers are overall unstable, will tune, fix and submit findings in a PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it should be a similar improvement on M2...

I would assume so but it seems that it is not the case.


if (AdvSimd.Arm64.MaxAcross(Vector128.AsUInt32(AdvSimd.And(or, v80))).ToScalar() != 0)
{
break;
}
}

}

processedLength += localasciirun - 16;
}
}
Expand Down