Releases · huggingface/tokenizers

28 Dec 13:06

Narsil

python-v0.11.1

7069988

Python v0.11.1

[#860] Adding TruncationSide to TruncationParams.

Assets 2

24 Dec 09:15

n1t0

python-v0.11.0

38a85b2

Python v0.11.0

Fixed

[#585] Conda version should now work on old CentOS
[#844] Fixing interaction between is_pretokenized and trim_offsets.
[#851] Doc links

Added

[#657]: Add SplitDelimiterBehavior customization to Punctuation constructor
[#845]: Documentation for Decoders.

Changed

[#850]: Added a feature gate to enable disabling http features
[#718]: Fix WordLevel tokenizer determinism during training
[#762]: Add a way to specify the unknown token in SentencePieceUnigramTokenizer
[#770]: Improved documentation for UnigramTrainer
[#780]: Add Tokenizer.from_pretrained to load tokenizers from the Hugging Face Hub
[#793]: Saving a pretty JSON file by default when saving a tokenizer

Assets 2

02 Sep 18:12

n1t0

node-v0.8.0

fd316bd

Node v0.8.0

BREACKING CHANGES

Many improvements on the Trainer (#519).
The files must now be provided first when calling tokenizer.train(files, trainer).

Features

Adding the TemplateProcessing
Add WordLevel and Unigram models (#490)
Add nmtNormalizer and precompiledNormalizer normalizers (#490)
Add templateProcessing post-processor (#490)
Add digitsPreTokenizer pre-tokenizer (#490)
Add support for mapping to sequences (#506)
Add splitPreTokenizer pre-tokenizer (#542)
Add behavior option to the punctuationPreTokenizer (#657)
Add the ability to load tokenizers from the Hugging Face Hub using fromPretrained (#780)

Fixes

Fix a bug where long tokenizer.json files would be incorrectly deserialized (#459)
Fix RobertaProcessing deserialization in PostProcessorWrapper (#464)

Assets 2

24 May 21:31

n1t0

python-v0.10.3

755e5f5

Python v0.10.3

Fixed

[#686]: Fix SPM conversion process for whitespace deduplication
[#707]: Fix stripping strings containing Unicode characters

Added

[#693]: Add a CTC Decoder for Wave2Vec models

Removed

[#714]: Removed support for Python 3.5

Assets 2

05 Apr 20:48

n1t0

python-v0.10.2

32b3b7a

Python v0.10.2

Fixed

[#652]: Fix offsets for Precompiled corner case
[#656]: Fix BPE continuing_subword_prefix
[#674]: Fix Metaspace serialization problems

Assets 2

04 Feb 15:38

n1t0

python-v0.10.1

bc8bbf6

Python v0.10.1

Fixed

[#616]: Fix SentencePiece tokenizers conversion
[#617]: Fix offsets produced by Precompiled Normalizer (used by tokenizers converted from SPM)
[#618]: Fix Normalizer.normalize with PyNormalizedStringRefMut
[#620]: Fix serialization/deserialization for overlapping models
[#621]: Fix ByteLevel instantiation from a previously saved state (using __getstate__())

Assets 2

12 Jan 21:36

n1t0

python-v0.10.0

719bea7

Python v0.10.0

Added

[#508]: Add a Visualizer for notebooks to help understand how the tokenizers work
[#519]: Add a WordLevelTrainer used to train a WordLevel model
[#533]: Add support for conda builds
[#542]: Add Split pre-tokenizer to easily split using a pattern
[#544]: Ability to train from memory. This also improves the integration with datasets
[#590]: Add getters/setters for components on BaseTokenizer
[#574]: Add fust_unk option to SentencePieceBPETokenizer

Changed

[#509]: Automatically stubbing the .pyi files
[#519]: Each Model can return its associated Trainer with get_trainer()
[#530]: The various attributes on each component can be get/set (ie.
tokenizer.model.dropout = 0.1)
[#538]: The API Reference has been improved and is now up-to-date.

Fixed

[#519]: During training, the Model is now trained in-place. This fixes several bugs that were
forcing to reload the Model after a training.
[#539]: Fix BaseTokenizer enable_truncation docstring

Assets 2

08 Dec 18:32

n1t0

python-v0.10.0rc1

6201258

Python v0.10.0rc1 Pre-release

Pre-release

Added

[#508]: Add a Visualizer for notebooks to help understand how the tokenizers work
[#519]: Add a WordLevelTrainer used to train a WordLevel model
[#533]: Add support for conda builds
[#542]: Add Split pre-tokenizer to easily split using a pattern
[#544]: Ability to train from memory. This also improves the integration with datasets

Changed

[#509]: Automatically stubbing the .pyi files
[#519]: Each Model can return its associated Trainer with get_trainer()
[#530]: The various attributes on each component can be get/set (ie.
tokenizer.model.dropout = 0.1)
[#538]: The API Reference has been improved and is now up-to-date.

Fixed

[#519]: During training, the Model is now trained in-place. This fixes several bugs that were
forcing to reload the Model after a training.
[#539]: Fix BaseTokenizer enable_truncation docstring

Assets 2

10 Nov 04:23

n1t0

python-v0.9.4

b122737

Python v0.9.4

Fixed

[#492]: Fix from_file on BertWordPieceTokenizer
[#498]: Fix the link to download sentencepiece_model_pb2.py
[#500]: Fix a typo in the docs quicktour

Changed

[#506]: Improve Encoding mappings for pairs of sequence

Assets 2

26 Oct 20:41

n1t0

python-v0.9.3

2364d37

Python v0.9.3

Fixed

[#470]: Fix hanging error when training with custom component
[#476]: TemplateProcessing serialization is now deterministic
[#481]: Fix SentencePieceBPETokenizer.from_files

Added

[#477]: UnicodeScripts PreTokenizer to avoid merges between various scripts
[#480]: Unigram now accepts an initial_alphabet and handles special_tokens correctly

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fixed

Added

Changed

Uh oh!

BREACKING CHANGES

Features

Fixes

Uh oh!

Fixed

Added

Removed

Uh oh!

Fixed

Uh oh!

Fixed

Uh oh!

Added

Changed

Fixed

Uh oh!

Added

Changed

Fixed

Uh oh!

Fixed

Changed

Uh oh!

Fixed

Added

Uh oh!

Releases: huggingface/tokenizers

Python v0.11.1

Uh oh!

Python v0.11.0

Fixed

Added

Changed

Uh oh!

Node v0.8.0

BREACKING CHANGES

Features

Fixes

Uh oh!

Python v0.10.3

Fixed

Added

Removed

Uh oh!

Python v0.10.2

Fixed

Uh oh!

Python v0.10.1

Fixed

Uh oh!

Python v0.10.0

Added

Changed

Fixed

Uh oh!

Python v0.10.0rc1

Added

Changed

Fixed

Uh oh!

Python v0.9.4

Fixed

Changed

Uh oh!

Python v0.9.3

Fixed

Added

Uh oh!