Skip to content

Commit 17c224d

Browse files
author
Daniel Lemire
committed
Not bad.
1 parent cb9bee8 commit 17c224d

File tree

10 files changed

+353
-1125
lines changed

10 files changed

+353
-1125
lines changed

README.md

Lines changed: 17 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -3,29 +3,21 @@
33

44
This is a fast C# library to process unicode strings.
55

6-
*It is currently not meant to be usable.*
76

87
## Motivation
98

10-
The most important immediate goal would be to speed up the
11-
`Utf8Utility.GetPointerToFirstInvalidByte` function.
9+
We seek to speed up the `Utf8Utility.GetPointerToFirstInvalidByte` function. Using the algorithm used by Node.js, Oracle GraalVM and other important systems.
1210

13-
https://github.com/dotnet/runtime/blob/4d709cd12269fcbb3d0fccfb2515541944475954/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs
14-
15-
16-
(We may need to speed up `Ascii.GetIndexOfFirstNonAsciiByte` first, see issue https://github.com/simdutf/SimdUnicode/issues/1.)
17-
18-
The question is whether we could do it using this routine:
11+
- John Keiser, Daniel Lemire, [Validating UTF-8 In Less Than One Instruction Per Byte](https://arxiv.org/abs/2010.03090), Software: Practice and Experience 51 (5), 2021
1912

20-
* John Keiser, Daniel Lemire, [Validating UTF-8 In Less Than One Instruction Per Byte](https://arxiv.org/abs/2010.03090), Software: Practice and Experience 51 (5), 2021
13+
The function is private in the Runtime, but we can expose it manually.
2114

22-
Our generic implementation is available there: https://github.com/simdutf/simdutf/blob/master/src/generic/utf8_validation/utf8_lookup4_algorithm.h
15+
https://github.com/dotnet/runtime/blob/4d709cd12269fcbb3d0fccfb2515541944475954/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs
2316

24-
Porting it to C# is no joke, but doable.
2517

2618
## Requirements
2719

28-
We recommend you install .NET 7: https://dotnet.microsoft.com/en-us/download/dotnet/7.0
20+
We recommend you install .NET 8: https://dotnet.microsoft.com/en-us/download/dotnet/8.0
2921

3022

3123
## Running tests
@@ -62,33 +54,6 @@ cd benchmark
6254
sudo dotnet run -c Release
6355
```
6456

65-
Still under macOS or Linux, you can change the filter parameter to narrow down the benchmarks you'd like to run:
66-
67-
```
68-
cd benchmark
69-
sudo dotnet run -c Release --filter *RealData*
70-
```
71-
72-
To get a list of all available tests you may enter:
73-
74-
```
75-
cd benchmark
76-
sudo dotnet run -c Release --list tree
77-
```
78-
79-
To get a prettier list in tree format, you may enter:
80-
81-
```
82-
cd benchmark
83-
sudo dotnet run -c Release --list tree
84-
```
85-
86-
To run all benchmarks, you may enter:
87-
88-
```
89-
sudo dotnet run -c Release runall
90-
```
91-
9257
## Building the library
9358

9459
```
@@ -105,6 +70,18 @@ cd test
10570
dotnet format
10671
```
10772

73+
## Programming tips
74+
75+
You can print the content of a vector register like so:
76+
77+
```C#
78+
public static void ToString(Vector256<byte> v)
79+
{
80+
Span<byte> b = stackalloc byte[32];
81+
v.CopyTo(b);
82+
Console.WriteLine(Convert.ToHexString(b));
83+
}
84+
```
10885

10986
## More reading
11087

benchmark/ASCII_runtime.cs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
//Vector128's by 16 (see:https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs,eb1e72a6f843c5a5)
2020
// GetIndexofFirstNonAsciiByte is no longer internal
2121

22-
namespace Competition
22+
namespace DotnetRuntime
2323
{
2424
// I copy pasted this from: https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/CompExactlyDependsOnAttribute.cs
2525
// Use this attribute to indicate that a function should only be compiled into a Ready2Run

0 commit comments

Comments
 (0)