Skip to content

Commit ee9ee16

Browse files
authored
Merge pull request #13 from simdutf/AVX2_UTF8_validation_disposable
Simplified version of AVX2 UTF-8 validation
2 parents 719b85f + 1a5e552 commit ee9ee16

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+56810
-249
lines changed

README.md

Lines changed: 34 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -3,29 +3,21 @@
33

44
This is a fast C# library to process unicode strings.
55

6-
*It is currently not meant to be usable.*
76

87
## Motivation
98

10-
The most important immediate goal would be to speed up the
11-
`Utf8Utility.GetPointerToFirstInvalidByte` function.
9+
We seek to speed up the `Utf8Utility.GetPointerToFirstInvalidByte` function. Using the algorithm used by Node.js, Oracle GraalVM and other important systems.
1210

13-
https://github.com/dotnet/runtime/blob/4d709cd12269fcbb3d0fccfb2515541944475954/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs
14-
15-
16-
(We may need to speed up `Ascii.GetIndexOfFirstNonAsciiByte` first, see issue https://github.com/simdutf/SimdUnicode/issues/1.)
11+
- John Keiser, Daniel Lemire, [Validating UTF-8 In Less Than One Instruction Per Byte](https://arxiv.org/abs/2010.03090), Software: Practice and Experience 51 (5), 2021
1712

18-
The question is whether we could do it using this routine:
13+
The function is private in the Runtime, but we can expose it manually.
1914

20-
* John Keiser, Daniel Lemire, [Validating UTF-8 In Less Than One Instruction Per Byte](https://arxiv.org/abs/2010.03090), Software: Practice and Experience 51 (5), 2021
21-
22-
Our generic implementation is available there: https://github.com/simdutf/simdutf/blob/master/src/generic/utf8_validation/utf8_lookup4_algorithm.h
15+
https://github.com/dotnet/runtime/blob/4d709cd12269fcbb3d0fccfb2515541944475954/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs
2316

24-
Porting it to C# is no joke, but doable.
2517

2618
## Requirements
2719

28-
We recommend you install .NET 7: https://dotnet.microsoft.com/en-us/download/dotnet/7.0
20+
We recommend you install .NET 8: https://dotnet.microsoft.com/en-us/download/dotnet/8.0
2921

3022

3123
## Running tests
@@ -35,8 +27,21 @@ cd test
3527
dotnet test
3628
```
3729

30+
To get a list of available tests, enter the command:
31+
32+
```
33+
dotnet test --list-tests
34+
```
35+
36+
To run specific tests, it is helpful to use the filter parameter:
37+
38+
```
39+
dotnet test -c Release --filter Ascii
40+
```
41+
3842
## Running Benchmarks
3943

44+
To run the benchmarks, run the following command:
4045
```
4146
cd benchmark
4247
dotnet run -c Release
@@ -49,7 +54,6 @@ cd benchmark
4954
sudo dotnet run -c Release
5055
```
5156

52-
5357
## Building the library
5458

5559
```
@@ -66,10 +70,26 @@ cd test
6670
dotnet format
6771
```
6872

73+
## Programming tips
74+
75+
You can print the content of a vector register like so:
76+
77+
```C#
78+
public static void ToString(Vector256<byte> v)
79+
{
80+
Span<byte> b = stackalloc byte[32];
81+
v.CopyTo(b);
82+
Console.WriteLine(Convert.ToHexString(b));
83+
}
84+
```
6985

7086
## More reading
7187

7288

7389
https://github.com/dotnet/coreclr/pull/21948/files#diff-2a22774bd6bff8e217ecbb3a41afad033ce0ca0f33645e9d8f5bdf7c9e3ac248
7490

7591
https://github.com/dotnet/runtime/issues/41699
92+
93+
https://learn.microsoft.com/en-us/dotnet/standard/design-guidelines/
94+
95+
https://learn.microsoft.com/en-us/dotnet/csharp/fundamentals/coding-style/coding-conventions

benchmark/CS_runtime.cs renamed to benchmark/ASCII_runtime.cs

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,15 +8,18 @@
88
using System.Runtime.Intrinsics.Arm;
99
using System.Runtime.Intrinsics.X86;
1010

11+
// This is from the Runtime. Copy/pasted as I found no other way to benchmark it.
12+
1113
//Changes from original:
1214
//copy pasted CompExactlyDependsOnAttribute : Attribute into System.Text namespace
1315
//copy/pasted StoreLowerUnsafe into ascii class
1416
//The various Vector.Size likely refer to size in bytes so
1517
//Replaced all instances of Vector512.Size by 64 (see:https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector512.cs,77df495766d5de9c)
1618
//Vector256's by 32 (see:https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector256.cs,877aa6254c4e4d00)
1719
//Vector128's by 16 (see:https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs,eb1e72a6f843c5a5)
20+
// GetIndexofFirstNonAsciiByte is no longer internal
1821

19-
namespace Competition
22+
namespace DotnetRuntime
2023
{
2124
// I copy pasted this from: https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/CompExactlyDependsOnAttribute.cs
2225
// Use this attribute to indicate that a function should only be compiled into a Ready2Run
@@ -213,6 +216,7 @@ private static bool FirstCharInUInt32IsAscii(uint value)
213216
/// <returns>An ASCII byte is defined as 0x00 - 0x7F, inclusive.</returns>
214217
[MethodImpl(MethodImplOptions.AggressiveInlining)]
215218
internal static unsafe nuint GetIndexOfFirstNonAsciiByte(byte* pBuffer, nuint bufferLength)
219+
// internal static unsafe nuint GetIndexOfFirstNonAsciiByte(byte* pBuffer, nuint bufferLength)
216220
{
217221
// If 256/512-bit aren't supported but SSE2 is supported, use those specific intrinsics instead of
218222
// the generic vectorized code. This has two benefits: (a) we can take advantage of specific instructions

0 commit comments

Comments
 (0)