Skip to content

Commit 5025830

Browse files
authored
Update readmes for Tokenizers and Microsoft.ML (#7070)
* Make docs changes skip validation builds * Apply package readme templates * Fill in content for package readmes * Address feedback
1 parent acced97 commit 5025830

File tree

7 files changed

+136
-5
lines changed

7 files changed

+136
-5
lines changed

.vsts-dotnet-ci.yml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,15 @@ trigger:
1515
- main
1616
- feature/*
1717
- release/*
18+
paths:
19+
include:
20+
- '*'
21+
exclude:
22+
- '**.md'
23+
- .github/*
24+
- docs/*
25+
- LICENSE
26+
- THIRD-PARTY-NOTICES.TXT
1827

1928
resources:
2029
containers:

build/codecoverage-ci.yml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,15 @@ trigger:
1515
- main
1616
- feature/*
1717
- release/*
18+
paths:
19+
include:
20+
- '*'
21+
exclude:
22+
- '**.md'
23+
- .github/*
24+
- docs/*
25+
- LICENSE
26+
- THIRD-PARTY-NOTICES.TXT
1827

1928
jobs:
2029
- template: /build/ci/job-template.yml

eng/Packaging.targets

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,16 @@
11
<Project>
2+
<PropertyGroup>
3+
<PackageReadmeFile Condition="'$(PackageReadmeFile)' == '' and Exists('PACKAGE.md')">PACKAGE.md</PackageReadmeFile>
4+
</PropertyGroup>
25

36
<ItemGroup>
47
<Content Include="$(RepositoryEngineeringDir)pkg\mlnetlogo.png" Pack="true" PackagePath="" />
58
</ItemGroup>
69

10+
<ItemGroup Condition="'$(PackageReadmeFile)' != ''">
11+
<None Include="$(PackageReadmeFile)" Pack="true" PackagePath="\" />
12+
</ItemGroup>
13+
714
<ItemGroup Condition="'$(IncludeMLNetNotices)' != 'false'">
815
<Content Include="$(RepoRoot)THIRD-PARTY-NOTICES.TXT" Pack="true" PackagePath="" />
916
<Content Include="$(RepoRoot)LICENSE" Pack="true" PackagePath=""/>
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
## About
2+
3+
Microsoft.ML.Tokenizers supports various the implmentation of the tokenization used in the NLP transforms.
4+
5+
## Key Features
6+
7+
* Extensisble tokenizer architecture that allows for specialization of Normalizer, PreTokenizer, Model/Encoder, Decoder
8+
* BPE - Byte pair encoding model
9+
* English Roberta model
10+
* Tiktoken model
11+
12+
## How to Use
13+
14+
```c#
15+
using Microsoft.ML.Tokenizers;
16+
17+
// initialize the tokenizer for `gpt-4` model, downloading data files
18+
Tokenizer tokenizer = await Tiktoken.CreateByModelNameAsync("gpt-4");
19+
20+
string source = "Text tokenization is the process of splitting a string into a list of tokens.";
21+
22+
Console.WriteLine($"Tokens: {tokenizer.CountTokens(source)}");
23+
// print: Tokens: 16
24+
25+
var trimIndex = tokenizer.LastIndexOfTokenCount(source, 5, out string processedText, out _);
26+
Console.WriteLine($"5 tokens from end: {processedText.Substring(trimIndex)}");
27+
// 5 tokens from end: a list of tokens.
28+
29+
trimIndex = tokenizer.IndexOfTokenCount(source, 5, out processedText, out _);
30+
Console.WriteLine($"5 tokens from start: {processedText.Substring(0, trimIndex)}");
31+
// 5 tokens from start: Text tokenization is the
32+
33+
IReadOnlyList<int> ids = tokenizer.EncodeToIds(source);
34+
Console.WriteLine(string.Join(", ", ids));
35+
// prints: 1199, 4037, 2065, 374, 279, 1920, 315, 45473, 264, 925, 1139, 264, 1160, 315, 11460, 13
36+
```
37+
38+
## Main Types
39+
40+
The main types provided by this library are:
41+
42+
* `Microsoft.ML.Tokenizers.Tokenizer`
43+
* `Microsoft.ML.Tokenizers.Bpe`
44+
* `Microsoft.ML.Tokenizers.EnglishRoberta`
45+
* `Microsoft.ML.Tokenizers.TikToken`
46+
* `Microsoft.ML.Tokenizers.TokenizerDecoder`
47+
* `Microsoft.ML.Tokenizers.Normalizer`
48+
* `Microsoft.ML.Tokenizers.PreTokenizer`
49+
50+
## Additional Documentation
51+
52+
* [Conceptual documentation](TODO)
53+
* [API documentation](https://learn.microsoft.com/en-us/dotnet/api/microsoft.ml.tokenizers)
54+
55+
## Related Packages
56+
57+
<!-- The related packages associated with this package -->
58+
59+
## Feedback & Contributing
60+
61+
Microsoft.ML.Tokenizers is released as open source under the [MIT license](https://licenses.nuget.org/MIT). Bug reports and contributions are welcome at [the GitHub repository](https://github.com/dotnet/machinelearning).

src/Microsoft.ML/Microsoft.ML.csproj

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,6 @@
1414
<NoWarn>$(NoWarn);NU5127;NU5128</NoWarn>
1515
<IsPackable>true</IsPackable>
1616
<PackageDescription>ML.NET is a cross-platform open-source machine learning framework which makes machine learning accessible to .NET developers.</PackageDescription>
17-
<PackageReadmeFile>README.md</PackageReadmeFile>
1817
</PropertyGroup>
1918
<ItemGroup>
2019
<ProjectReference Include="../Microsoft.ML.DataView/Microsoft.ML.DataView.csproj" />
@@ -36,7 +35,6 @@
3635
<ItemGroup>
3736
<Content Include="$(RepoRoot)eng\pkg\CommonPackage.props" Pack="true" PackagePath="build\netstandard2.0\$(MSBuildProjectName).props" />
3837
<Content Include="build\**\*" Pack="true" PackagePath="build" />
39-
<None Include="README.md" Pack="true" PackagePath="\"/> <!--NuGet PackageReadmeFile-->
4038
</ItemGroup>
4139

4240
</Project>

src/Microsoft.ML/PACKAGE.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
## About
2+
3+
ML.NET is a cross-platform open-source machine learning framework which makes machine learning accessible to .NET developers.
4+
5+
For more information, see the [ML.NET documentation](https://docs.microsoft.com/dotnet/machine-learning/).
6+
7+
## Key Features
8+
9+
* Classification/Categorization - Automatically divide customer feedback into positive and negative categories
10+
* Regression/Predict continuous values - Predict the price of houses based on size and location
11+
* Anomaly Detection - Detect fraudulent banking transactions
12+
* Recommendations - Suggest products that online shoppers may want to buy, based on their previous purchases
13+
* Time series/sequential data - Forecast the weather/product sales
14+
* Image classification - Categorize pathologies in medical images
15+
* Text classification - Categorize documents based on their content.
16+
* Sentence similarity - Measure how similar two sentences are.
17+
18+
## How to Use
19+
20+
See [Machine Learning Samples](https://github.com/dotnet/machinelearning-samples) for an assortment of samples that show how to get started using ML.NET.
21+
22+
## Main Types
23+
24+
Some of the types provided by this library are:
25+
26+
* `Microsoft.ML.MLContext`
27+
* `Microsoft.ML.ITransformer`
28+
* `Microsoft.ML.IEstimator<TTransformer>`
29+
30+
## Additional Documentation
31+
32+
* [Conceptual documentation](https://learn.microsoft.com/en-us/dotnet/machine-learning/)
33+
* [API documentation](https://learn.microsoft.com/en-us/dotnet/api/microsoft.ml)
34+
35+
## Related Packages
36+
37+
* Core data abstraction: [Microsoft.ML.DataView](https://www.nuget.org/packages/Microsoft.ML.DataView)
38+
* LightGBM Model Support: [Microsoft.ML.LightGbm](https://www.nuget.org/packages/Microsoft.ML.LightGbm)
39+
* Fast Tree: [Microsoft.ML.FastTree](https://www.nuget.org/packages/Microsoft.ML.FastTree)
40+
* Image analytics: [Microsoft.ML.ImageAnalytics](https://www.nuget.org/packages/Microsoft.ML.ImageAnalytics)
41+
* Reccomender: [Microsoft.ML.Recommender](https://www.nuget.org/packages/Microsoft.ML.Recommender)
42+
* Time series: [Microsoft.ML.TimeSeries](https://www.nuget.org/packages/Microsoft.ML.TimeSeries)
43+
* Automatic model selection / tuning: [Microsoft.ML.AutoML](https://www.nuget.org/packages/Microsoft.ML.AutoML)
44+
* Exporting Onnx Models: [Microsoft.ML.OnnxConverter](https://www.nuget.org/packages/Microsoft.ML.OnnxConverter)
45+
* Loading Onnx models: [Microsoft.ML.OnnxTransformer](https://www.nuget.org/packages/Microsoft.ML.OnnxTransformer)
46+
* Text tokenizers: [Microsoft.ML.Tokenizers](https://www.nuget.org/packages/Microsoft.ML.Tokenizers)
47+
48+
## Feedback & Contributing
49+
50+
Microsoft.ML is released as open source under the [MIT license](https://licenses.nuget.org/MIT). Bug reports and contributions are welcome at [the GitHub repository](https://github.com/dotnet/machinelearning).

src/Microsoft.ML/README.md

Lines changed: 0 additions & 3 deletions
This file was deleted.

0 commit comments

Comments
 (0)