Skip to content

Releases: huggingface/tokenizers

Python v0.11.1

28 Dec 13:06
Compare
Choose a tag to compare

[#860] Adding TruncationSide to TruncationParams.

Python v0.11.0

24 Dec 09:15
Compare
Choose a tag to compare

Fixed

  • [#585] Conda version should now work on old CentOS
  • [#844] Fixing interaction between is_pretokenized and trim_offsets.
  • [#851] Doc links

Added

  • [#657]: Add SplitDelimiterBehavior customization to Punctuation constructor
  • [#845]: Documentation for Decoders.

Changed

  • [#850]: Added a feature gate to enable disabling http features
  • [#718]: Fix WordLevel tokenizer determinism during training
  • [#762]: Add a way to specify the unknown token in SentencePieceUnigramTokenizer
  • [#770]: Improved documentation for UnigramTrainer
  • [#780]: Add Tokenizer.from_pretrained to load tokenizers from the Hugging Face Hub
  • [#793]: Saving a pretty JSON file by default when saving a tokenizer

Node v0.8.0

02 Sep 18:12
Compare
Choose a tag to compare

BREACKING CHANGES

  • Many improvements on the Trainer (#519).
    The files must now be provided first when calling tokenizer.train(files, trainer).

Features

  • Adding the TemplateProcessing
  • Add WordLevel and Unigram models (#490)
  • Add nmtNormalizer and precompiledNormalizer normalizers (#490)
  • Add templateProcessing post-processor (#490)
  • Add digitsPreTokenizer pre-tokenizer (#490)
  • Add support for mapping to sequences (#506)
  • Add splitPreTokenizer pre-tokenizer (#542)
  • Add behavior option to the punctuationPreTokenizer (#657)
  • Add the ability to load tokenizers from the Hugging Face Hub using fromPretrained (#780)

Fixes

  • Fix a bug where long tokenizer.json files would be incorrectly deserialized (#459)
  • Fix RobertaProcessing deserialization in PostProcessorWrapper (#464)

Python v0.10.3

24 May 21:31
755e5f5
Compare
Choose a tag to compare

Fixed

  • [#686]: Fix SPM conversion process for whitespace deduplication
  • [#707]: Fix stripping strings containing Unicode characters

Added

  • [#693]: Add a CTC Decoder for Wave2Vec models

Removed

  • [#714]: Removed support for Python 3.5

Python v0.10.2

05 Apr 20:48
Compare
Choose a tag to compare

Fixed

  • [#652]: Fix offsets for Precompiled corner case
  • [#656]: Fix BPE continuing_subword_prefix
  • [#674]: Fix Metaspace serialization problems

Python v0.10.1

04 Feb 15:38
bc8bbf6
Compare
Choose a tag to compare

Fixed

  • [#616]: Fix SentencePiece tokenizers conversion
  • [#617]: Fix offsets produced by Precompiled Normalizer (used by tokenizers converted from SPM)
  • [#618]: Fix Normalizer.normalize with PyNormalizedStringRefMut
  • [#620]: Fix serialization/deserialization for overlapping models
  • [#621]: Fix ByteLevel instantiation from a previously saved state (using __getstate__())

Python v0.10.0

12 Jan 21:36
Compare
Choose a tag to compare

Added

  • [#508]: Add a Visualizer for notebooks to help understand how the tokenizers work
  • [#519]: Add a WordLevelTrainer used to train a WordLevel model
  • [#533]: Add support for conda builds
  • [#542]: Add Split pre-tokenizer to easily split using a pattern
  • [#544]: Ability to train from memory. This also improves the integration with datasets
  • [#590]: Add getters/setters for components on BaseTokenizer
  • [#574]: Add fust_unk option to SentencePieceBPETokenizer

Changed

  • [#509]: Automatically stubbing the .pyi files
  • [#519]: Each Model can return its associated Trainer with get_trainer()
  • [#530]: The various attributes on each component can be get/set (ie.
    tokenizer.model.dropout = 0.1)
  • [#538]: The API Reference has been improved and is now up-to-date.

Fixed

  • [#519]: During training, the Model is now trained in-place. This fixes several bugs that were
    forcing to reload the Model after a training.
  • [#539]: Fix BaseTokenizer enable_truncation docstring

Python v0.10.0rc1

08 Dec 18:32
Compare
Choose a tag to compare
Python v0.10.0rc1 Pre-release
Pre-release

Added

  • [#508]: Add a Visualizer for notebooks to help understand how the tokenizers work
  • [#519]: Add a WordLevelTrainer used to train a WordLevel model
  • [#533]: Add support for conda builds
  • [#542]: Add Split pre-tokenizer to easily split using a pattern
  • [#544]: Ability to train from memory. This also improves the integration with datasets

Changed

  • [#509]: Automatically stubbing the .pyi files
  • [#519]: Each Model can return its associated Trainer with get_trainer()
  • [#530]: The various attributes on each component can be get/set (ie.
    tokenizer.model.dropout = 0.1)
  • [#538]: The API Reference has been improved and is now up-to-date.

Fixed

  • [#519]: During training, the Model is now trained in-place. This fixes several bugs that were
    forcing to reload the Model after a training.
  • [#539]: Fix BaseTokenizer enable_truncation docstring

Python v0.9.4

10 Nov 04:23
b122737
Compare
Choose a tag to compare

Fixed

  • [#492]: Fix from_file on BertWordPieceTokenizer
  • [#498]: Fix the link to download sentencepiece_model_pb2.py
  • [#500]: Fix a typo in the docs quicktour

Changed

  • [#506]: Improve Encoding mappings for pairs of sequence

Python v0.9.3

26 Oct 20:41
Compare
Choose a tag to compare

Fixed

  • [#470]: Fix hanging error when training with custom component
  • [#476]: TemplateProcessing serialization is now deterministic
  • [#481]: Fix SentencePieceBPETokenizer.from_files

Added

  • [#477]: UnicodeScripts PreTokenizer to avoid merges between various scripts
  • [#480]: Unigram now accepts an initial_alphabet and handles special_tokens correctly