Skip to content

Commit e2e9f74

Browse files
committed
fixes arm
1 parent 89d6bec commit e2e9f74

File tree

2 files changed

+26
-10
lines changed

2 files changed

+26
-10
lines changed

README.md

Lines changed: 21 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,15 +6,31 @@ This is a fast C# library to validate UTF-8 strings.
66

77
## Motivation
88

9-
We seek to speed up the `Utf8Utility.GetPointerToFirstInvalidByte` function. Using the algorithm used by Node.js, Oracle GraalVM and other important systems.
10-
11-
- John Keiser, Daniel Lemire, [Validating UTF-8 In Less Than One Instruction Per Byte](https://arxiv.org/abs/2010.03090), Software: Practice and Experience 51 (5), 2021
9+
We seek to speed up the `Utf8Utility.GetPointerToFirstInvalidByte` function from the C# runtime library.
10+
[The function is private in the Microsoft Runtime](https://github.com/dotnet/runtime/blob/4d709cd12269fcbb3d0fccfb2515541944475954/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs), but we can expose it manually.
1211

13-
The algorithm in question is part of popular JavaScript runtimes such as Node.js and Bun, [by PHP](https://github.com/php/php-src/blob/90e0ce7f0db99767c58dc21e4213c0f8763f657a/ext/mbstring/mbstring.c#L5270), by Oracle GraalVM and many important systems.
12+
Specifically, we provide the function `SimdUnicode.UTF8.GetPointerToFirstInvalidByte` which is a faster
13+
drop-in replacement:
14+
```cs
15+
// Returns &inputBuffer[inputLength] if the input buffer is valid.
16+
/// <summary>
17+
/// Given an input buffer <paramref name="pInputBuffer"/> of byte length <paramref name="inputLength"/>,
18+
/// returns a pointer to where the first invalid data appears in <paramref name="pInputBuffer"/>.
19+
/// The parameter <paramref name="Utf16CodeUnitCountAdjustment"/> is set according to the content of the valid UTF-8 characters encountered, counting -1 for each 2-byte character, -2 for each 3-byte character, and -3 for each 4-byte character.
20+
/// The parameter <paramref name="ScalarCodeUnitCountAdjustment"/> is set according to the content of the valid UTF-8 characters encountered, counting -1 for each 4-byte character.
21+
/// </summary>
22+
/// <remarks>
23+
/// Returns a pointer to the end of <paramref name="pInputBuffer"/> if the buffer is well-formed.
24+
/// </remarks>
25+
public unsafe static byte* GetPointerToFirstInvalidByte(byte* pInputBuffer, int inputLength, out int Utf16CodeUnitCountAdjustment, out int ScalarCodeUnitCountAdjustment);
26+
```
1427

15-
[The function is private in the Microsoft Runtime](https://github.com/dotnet/runtime/blob/4d709cd12269fcbb3d0fccfb2515541944475954/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs), but we can expose it manually.
28+
The function uses advanced instructions (SIMD) on 64-bit ARM and x64 processors, but fallbacks on a
29+
conventional implementation on other systems. We provide extensive tests and benchmarks.
1630

31+
We apply the algorithm used by Node.js, Bun, Oracle GraalVM, by the PHP interpreter and other important systems. The algorithm has been described in the follow article:
1732

33+
- John Keiser, Daniel Lemire, [Validating UTF-8 In Less Than One Instruction Per Byte](https://arxiv.org/abs/2010.03090), Software: Practice and Experience 51 (5), 2021
1834

1935

2036
## Requirements

src/UTF8.cs

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1037,13 +1037,13 @@ private unsafe static (int utfadjust, int scalaradjust) calculateErrorPathadjust
10371037
prevIncomplete = AdvSimd.SubtractSaturate(currentBlock, maxValue);
10381038
Vector128<sbyte> largestcont = Vector128.Create((sbyte)-65); // -65 => 0b10111111
10391039
contbytes += -AdvSimd.Arm64.AddAcross(AdvSimd.CompareLessThanOrEqual(Vector128.AsSByte(currentBlock), largestcont)).ToScalar();
1040-
Vector128<byte> fourthByteMinusOne = Vector128.Create((byte)(0b11110000u - 1));
10411040

10421041
// computing n4 is more expensive than we would like:
1043-
var largerthan0f = AdvSimd.CompareGreaterThan(currentBlock, fourthByteMinusOne);
1044-
var largerthan0fones = AdvSimd.And(largerthan0f, Vector128.Create((byte)1));
1045-
var largerthan0fonescount = AdvSimd.Arm64.AddAcross(largerthan0fones).ToScalar();
1046-
n4 += largerthan0fonescount;
1042+
Vector128<byte> fourthByteMinusOne = Vector128.Create((byte)(0b11110000u - 1));
1043+
Vector128<byte> largerthan0f = AdvSimd.CompareGreaterThan(currentBlock, fourthByteMinusOne);
1044+
byte n4add = (byte)AdvSimd.Arm64.AddAcross(largerthan0f).ToScalar();
1045+
int negn4add = (int)(byte)-n4add;
1046+
n4 += negn4add;
10471047
}
10481048
asciibytes -= (sbyte)AdvSimd.Arm64.AddAcross(AdvSimd.CompareLessThan(currentBlock, v80)).ToScalar();
10491049
}

0 commit comments

Comments
 (0)