You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+28-7Lines changed: 28 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -6,15 +6,31 @@ This is a fast C# library to validate UTF-8 strings.
6
6
7
7
## Motivation
8
8
9
-
We seek to speed up the `Utf8Utility.GetPointerToFirstInvalidByte` function. Using the algorithm used by Node.js, Oracle GraalVM and other important systems.
10
-
11
-
- John Keiser, Daniel Lemire, [Validating UTF-8 In Less Than One Instruction Per Byte](https://arxiv.org/abs/2010.03090), Software: Practice and Experience 51 (5), 2021
9
+
We seek to speed up the `Utf8Utility.GetPointerToFirstInvalidByte` function from the C# runtime library.
10
+
[The function is private in the Microsoft Runtime](https://github.com/dotnet/runtime/blob/4d709cd12269fcbb3d0fccfb2515541944475954/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs), but we can expose it manually.
12
11
13
-
The algorithm in question is part of popular JavaScript runtimes such as Node.js and Bun, [by PHP](https://github.com/php/php-src/blob/90e0ce7f0db99767c58dc21e4213c0f8763f657a/ext/mbstring/mbstring.c#L5270), by Oracle GraalVM and many important systems.
12
+
Specifically, we provide the function `SimdUnicode.UTF8.GetPointerToFirstInvalidByte` which is a faster
13
+
drop-in replacement:
14
+
```cs
15
+
// Returns &inputBuffer[inputLength] if the input buffer is valid.
16
+
/// <summary>
17
+
/// Given an input buffer <paramref name="pInputBuffer"/> of byte length <paramref name="inputLength"/>,
18
+
/// returns a pointer to where the first invalid data appears in <paramref name="pInputBuffer"/>.
19
+
/// The parameter <paramref name="Utf16CodeUnitCountAdjustment"/> is set according to the content of the valid UTF-8 characters encountered, counting -1 for each 2-byte character, -2 for each 3-byte and 4-byte characters.
20
+
/// The parameter <paramref name="ScalarCodeUnitCountAdjustment"/> is set according to the content of the valid UTF-8 characters encountered, counting -1 for each 4-byte character.
21
+
/// </summary>
22
+
/// <remarks>
23
+
/// Returns a pointer to the end of <paramref name="pInputBuffer"/> if the buffer is well-formed.
[The function is private in the Microsoft Runtime](https://github.com/dotnet/runtime/blob/4d709cd12269fcbb3d0fccfb2515541944475954/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs), but we can expose it manually.
28
+
The function uses advanced instructions (SIMD) on 64-bit ARM and x64 processors, but fallbacks on a
29
+
conventional implementation on other systems. We provide extensive tests and benchmarks.
16
30
31
+
We apply the algorithm used by Node.js, Bun, Oracle GraalVM, by the PHP interpreter and other important systems. The algorithm has been described in the follow article:
17
32
33
+
- John Keiser, Daniel Lemire, [Validating UTF-8 In Less Than One Instruction Per Byte](https://arxiv.org/abs/2010.03090), Software: Practice and Experience 51 (5), 2021
18
34
19
35
20
36
## Requirements
@@ -30,6 +46,11 @@ dotnet test
30
46
31
47
To see which tests are running, we recommend setting the verbosity level:
32
48
49
+
```
50
+
dotnet test -v=normal
51
+
```
52
+
53
+
More details could be useful:
33
54
```
34
55
dotnet test -v d
35
56
```
@@ -44,7 +65,7 @@ To run specific tests, it is helpful to use the filter parameter:
44
65
45
66
46
67
```
47
-
dotnet test --filter TooShortErrorAVX
68
+
dotnet test --filter TooShortErrorAvx2
48
69
```
49
70
50
71
Or to target specific categories:
@@ -89,7 +110,6 @@ dotnet build
89
110
We recommend you use `dotnet format`. E.g.,
90
111
91
112
```
92
-
cd test
93
113
dotnet format
94
114
```
95
115
@@ -115,6 +135,7 @@ You can print the content of a vector register like so:
115
135
## Performance tips
116
136
117
137
- Be careful: `Vector128.Shuffle` is not the same as `Ssse3.Shuffle` nor is `Vector128.Shuffle` the same as `Avx2.Shuffle`. Prefer the latter.
138
+
- Similarly `Vector128.Shuffle` is not the same as `AdvSimd.Arm64.VectorTableLookup`, use the latter.
0 commit comments