Releases: huggingface/tokenizers
Releases · huggingface/tokenizers
Python v0.11.1
[#860] Adding TruncationSide
to TruncationParams
.
Python v0.11.0
Fixed
- [#585] Conda version should now work on old CentOS
- [#844] Fixing interaction between
is_pretokenized
andtrim_offsets
. - [#851] Doc links
Added
- [#657]: Add SplitDelimiterBehavior customization to Punctuation constructor
- [#845]: Documentation for
Decoders
.
Changed
- [#850]: Added a feature gate to enable disabling
http
features - [#718]: Fix
WordLevel
tokenizer determinism during training - [#762]: Add a way to specify the unknown token in
SentencePieceUnigramTokenizer
- [#770]: Improved documentation for
UnigramTrainer
- [#780]: Add
Tokenizer.from_pretrained
to load tokenizers from the Hugging Face Hub - [#793]: Saving a pretty JSON file by default when saving a tokenizer
Node v0.8.0
BREACKING CHANGES
- Many improvements on the Trainer (#519).
The files must now be provided first when callingtokenizer.train(files, trainer)
.
Features
- Adding the
TemplateProcessing
- Add
WordLevel
andUnigram
models (#490) - Add
nmtNormalizer
andprecompiledNormalizer
normalizers (#490) - Add
templateProcessing
post-processor (#490) - Add
digitsPreTokenizer
pre-tokenizer (#490) - Add support for mapping to sequences (#506)
- Add
splitPreTokenizer
pre-tokenizer (#542) - Add
behavior
option to thepunctuationPreTokenizer
(#657) - Add the ability to load tokenizers from the Hugging Face Hub using
fromPretrained
(#780)
Fixes
Python v0.10.3
Python v0.10.2
Python v0.10.1
Fixed
- [#616]: Fix SentencePiece tokenizers conversion
- [#617]: Fix offsets produced by Precompiled Normalizer (used by tokenizers converted from SPM)
- [#618]: Fix Normalizer.normalize with
PyNormalizedStringRefMut
- [#620]: Fix serialization/deserialization for overlapping models
- [#621]: Fix
ByteLevel
instantiation from a previously saved state (using__getstate__()
)
Python v0.10.0
Added
- [#508]: Add a Visualizer for notebooks to help understand how the tokenizers work
- [#519]: Add a
WordLevelTrainer
used to train aWordLevel
model - [#533]: Add support for conda builds
- [#542]: Add Split pre-tokenizer to easily split using a pattern
- [#544]: Ability to train from memory. This also improves the integration with
datasets
- [#590]: Add getters/setters for components on BaseTokenizer
- [#574]: Add
fust_unk
option to SentencePieceBPETokenizer
Changed
- [#509]: Automatically stubbing the
.pyi
files - [#519]: Each
Model
can return its associatedTrainer
withget_trainer()
- [#530]: The various attributes on each component can be get/set (ie.
tokenizer.model.dropout = 0.1
) - [#538]: The API Reference has been improved and is now up-to-date.
Fixed
Python v0.10.0rc1
Added
- [#508]: Add a Visualizer for notebooks to help understand how the tokenizers work
- [#519]: Add a
WordLevelTrainer
used to train aWordLevel
model - [#533]: Add support for conda builds
- [#542]: Add Split pre-tokenizer to easily split using a pattern
- [#544]: Ability to train from memory. This also improves the integration with
datasets
Changed
- [#509]: Automatically stubbing the
.pyi
files - [#519]: Each
Model
can return its associatedTrainer
withget_trainer()
- [#530]: The various attributes on each component can be get/set (ie.
tokenizer.model.dropout = 0.1
) - [#538]: The API Reference has been improved and is now up-to-date.