Releases · huggingface/tokenizers

02 Mar 19:53

n1t0

rust-v0.8.0

9256ec6

Rust v0.8.0

Changes:

Big improvements in speed for BPE (Both training and tokenization) (#165)

Fixes:

Do not open all files directly while training (#163)
There was a bug in ByteLevel PreTokenizer that caused offsets to be wrong if a char got split up
in multiple bytes. (cf #156)
The LongestFirst truncation strategy had a bug (#174)

Assets 2

02 Mar 20:03

n1t0

python-v0.6.0

afe96da

Python v0.6.0

Changes:

Big improvements in speed for BPE (Both training and tokenization) (#165)

Fixes:

Some default tokens were missing from BertWordPieceTokenizer (cf #160)
There was a bug in ByteLevel PreTokenizer that caused offsets to be wrong if a char got split up
in multiple bytes. (cf #156)
The longest_first truncation strategy had a bug (#174)

Assets 2

24 Feb 21:10

n1t0

python-v0.5.2

440e8e9

Python v0.5.2

Fixes:

We introduced a bug related to the saving of the WordPiece model in 0.5.2: The vocab.txt file was named
vocab.json. This is now fixed.
The WordLevel model was also saving its vocabulary in the wrong format.

Assets 2

24 Feb 15:16

n1t0

python-v0.5.1

be08d95

Python v0.5.1

Changes:

name argument is now optional when saving a Model's vocabulary. When the name is not specified,
the files get a more generic naming, like vocab.json or merges.txt.

Assets 2

18 Feb 23:59

n1t0

python-v0.5.0

11dd6c8

Python v0.5.0

Changes:

BertWordPieceTokenizer now cleans up some tokenization artifacts while decoding (cf #145)
ByteLevelBPETokenizer now has dropout (thanks @colinclement with #149)
Added a new Strip normalizer
do_lowercase has been changed to lowercase for consistency between the different tokenizers. (Especially ByteLevelBPETokenizer and CharBPETokenizer)
Expose __len__ on Encoding (cf #139)
Improved padding performances.

Fixes:

#145: Decoding was buggy on BertWordPieceTokenizer.
#152: Some documentation and examples were still using the old BPETokenizer

Assets 2

11 Feb 13:24

n1t0

python-v0.4.2

bbbd97c

Python v0.4.2

Fixes:

Fix a bug in the class WordPieceTrainer that prevented BertWordPieceTokenizer from being trained. (cf #137)

Assets 2

11 Feb 04:34

n1t0

python-v0.4.1

c1ddfda

Python v0.4.1

Fixes:

Fix a bug related to the punctuation in BertWordPieceTokenizer (Thanks to @Mansterteddy with #134)

Assets 2

10 Feb 21:12

n1t0

python-v0.4.0

3c0164e

Python v0.4.0

Changes:

Replaced all .new() class methods by a proper __new__ implementation. (Huge thanks to @ljos with #131)
Improved typings

Assets 2

05 Feb 19:03

n1t0

python-v0.3.0

9745786

Python v0.3.0

Changes:

BPETokenizer has been renamed to CharBPETokenizer for clarity.
Added CharDelimiterSplit: a new PreTokenizer that allows splitting sequences on the given delimiter (Works like .split(delimiter))
Added WordLevel: a new model that simply maps tokens to their ids.
Improve truncation/padding and the handling of overflowing tokens. Now when a sequence gets truncated, we provide a list of overflowing Encoding that are ready to be processed by a language model, just as the main Encoding.
Provide mapping to the original string offsets using:

output = tokenizer.encode(...)
print(output.original_str.offsets(output.offsets[3]))

Exposed the vocabulary size on all tokenizers: #99 by @kdexd

Bug fixes:

Fix a bug with IndexableString
Fix a bug with truncation

Assets 2

22 Jan 21:13

n1t0

python-v0.2.1

0105021

Python v0.2.1

Fix a bug with the IDs associated with added tokens.
Fix a bug that was causing crashes in Python 3.5

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Changes:

Fixes:

Uh oh!

Changes:

Fixes:

Uh oh!

Fixes:

Uh oh!

Changes:

Uh oh!

Changes:

Fixes:

Uh oh!

Fixes:

Uh oh!

Fixes:

Uh oh!

Changes:

Uh oh!

Changes:

Bug fixes:

Uh oh!

Uh oh!

Releases: huggingface/tokenizers

Rust v0.8.0

Changes:

Fixes:

Uh oh!

Python v0.6.0

Changes:

Fixes:

Uh oh!

Python v0.5.2

Fixes:

Uh oh!

Python v0.5.1

Changes:

Uh oh!

Python v0.5.0

Changes:

Fixes:

Uh oh!

Python v0.4.2

Fixes:

Uh oh!

Python v0.4.1

Fixes:

Uh oh!

Python v0.4.0

Changes:

Uh oh!

Python v0.3.0

Changes:

Bug fixes:

Uh oh!

Python v0.2.1

Uh oh!