Promote TorchSharp as One-Stop for .NET tokenization library need #404
GeorgeS2019
started this conversation in
Ideas
Replies: 1 comment
-
Finally Microsoft.ML.Tokenizers |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
The .NET tokenization library to go has been discussed many times.
Many users have suggested the promising solution: BlingFire provided by Microsoft.
It is vital to learn from the vast python community how best to create such solution.
Recently the existing TorchSharp has been re-organized to TorchShrap.Core.
I see this a timely point to explore how to package the tokenization solution under TorchSharp.Text (like PyTorch.Text or Tensorlfow.NET.Text)
A quick preliminary evaluation of PyTorch.Text functionalities suggest BlingFire do cover many of PyTorch.Text features WELL! (performance superiority with respect to Hugging Face Tokenizers)
This is the current state of TorchSharp.Text

By integrating BlingFire to TorchSharp.Text, more essential functionalities commonly needed by .NET community will be available as ONE-STOP
Most important, beyond PyTorch.Text, by integrating BlingFire, TorchSharp.Text will provide many needed HuggingFace Tokenizers as requested here.
By integrating and implementing TorchSharp.Text as suggested above, the community can move towards the End-To-End real world Deep NLP examples as requested in the April 2021 survey.
Instead of the existing example based on the Transformer architecture, the community can contribute to that into the End-To-End example including a better performance Hugging Face tokenizer (BlingFire) provided through TorchSharp.Text.
Beta Was this translation helpful? Give feedback.
All reactions