Skip to content

Releases: huggingface/tokenizers

Rust v0.8.0

02 Mar 19:53
Compare
Choose a tag to compare

Changes:

  • Big improvements in speed for BPE (Both training and tokenization) (#165)

Fixes:

  • Do not open all files directly while training (#163)
  • There was a bug in ByteLevel PreTokenizer that caused offsets to be wrong if a char got split up
    in multiple bytes. (cf #156)
  • The LongestFirst truncation strategy had a bug (#174)

Python v0.6.0

02 Mar 20:03
Compare
Choose a tag to compare

Changes:

  • Big improvements in speed for BPE (Both training and tokenization) (#165)

Fixes:

  • Some default tokens were missing from BertWordPieceTokenizer (cf #160)
  • There was a bug in ByteLevel PreTokenizer that caused offsets to be wrong if a char got split up
    in multiple bytes. (cf #156)
  • The longest_first truncation strategy had a bug (#174)

Python v0.5.2

24 Feb 21:10
Compare
Choose a tag to compare

Fixes:

  • We introduced a bug related to the saving of the WordPiece model in 0.5.2: The vocab.txt file was named
    vocab.json. This is now fixed.
  • The WordLevel model was also saving its vocabulary in the wrong format.

Python v0.5.1

24 Feb 15:16
Compare
Choose a tag to compare

Changes:

  • name argument is now optional when saving a Model's vocabulary. When the name is not specified,
    the files get a more generic naming, like vocab.json or merges.txt.

Python v0.5.0

18 Feb 23:59
Compare
Choose a tag to compare

Changes:

  • BertWordPieceTokenizer now cleans up some tokenization artifacts while decoding (cf #145)
  • ByteLevelBPETokenizer now has dropout (thanks @colinclement with #149)
  • Added a new Strip normalizer
  • do_lowercase has been changed to lowercase for consistency between the different tokenizers. (Especially ByteLevelBPETokenizer and CharBPETokenizer)
  • Expose __len__ on Encoding (cf #139)
  • Improved padding performances.

Fixes:

  • #145: Decoding was buggy on BertWordPieceTokenizer.
  • #152: Some documentation and examples were still using the old BPETokenizer

Python v0.4.2

11 Feb 13:24
Compare
Choose a tag to compare

Fixes:

  • Fix a bug in the class WordPieceTrainer that prevented BertWordPieceTokenizer from being trained. (cf #137)

Python v0.4.1

11 Feb 04:34
Compare
Choose a tag to compare

Fixes:

  • Fix a bug related to the punctuation in BertWordPieceTokenizer (Thanks to @Mansterteddy with #134)

Python v0.4.0

10 Feb 21:12
Compare
Choose a tag to compare

Changes:

  • Replaced all .new() class methods by a proper __new__ implementation. (Huge thanks to @ljos with #131)
  • Improved typings

Python v0.3.0

05 Feb 19:03
Compare
Choose a tag to compare

Changes:

  • BPETokenizer has been renamed to CharBPETokenizer for clarity.
  • Added CharDelimiterSplit: a new PreTokenizer that allows splitting sequences on the given delimiter (Works like .split(delimiter))
  • Added WordLevel: a new model that simply maps tokens to their ids.
  • Improve truncation/padding and the handling of overflowing tokens. Now when a sequence gets truncated, we provide a list of overflowing Encoding that are ready to be processed by a language model, just as the main Encoding.
  • Provide mapping to the original string offsets using:
output = tokenizer.encode(...)
print(output.original_str.offsets(output.offsets[3]))
  • Exposed the vocabulary size on all tokenizers: #99 by @kdexd

Bug fixes:

  • Fix a bug with IndexableString
  • Fix a bug with truncation

Python v0.2.1

22 Jan 21:13
Compare
Choose a tag to compare
  • Fix a bug with the IDs associated with added tokens.
  • Fix a bug that was causing crashes in Python 3.5