Releases: huggingface/tokenizers
Releases · huggingface/tokenizers
Rust v0.8.0
Python v0.6.0
Python v0.5.2
Fixes:
- We introduced a bug related to the saving of the WordPiece model in 0.5.2: The
vocab.txt
file was named
vocab.json
. This is now fixed. - The
WordLevel
model was also saving its vocabulary in the wrong format.
Python v0.5.1
Changes:
name
argument is now optional when saving aModel
's vocabulary. When the name is not specified,
the files get a more generic naming, likevocab.json
ormerges.txt
.
Python v0.5.0
Changes:
BertWordPieceTokenizer
now cleans up some tokenization artifacts while decoding (cf #145)ByteLevelBPETokenizer
now hasdropout
(thanks @colinclement with #149)- Added a new
Strip
normalizer do_lowercase
has been changed tolowercase
for consistency between the different tokenizers. (EspeciallyByteLevelBPETokenizer
andCharBPETokenizer
)- Expose
__len__
onEncoding
(cf #139) - Improved padding performances.
Fixes:
Python v0.4.2
Fixes:
- Fix a bug in the class
WordPieceTrainer
that preventedBertWordPieceTokenizer
from being trained. (cf #137)
Python v0.4.1
Fixes:
- Fix a bug related to the punctuation in BertWordPieceTokenizer (Thanks to @Mansterteddy with #134)
Python v0.4.0
Python v0.3.0
Changes:
- BPETokenizer has been renamed to CharBPETokenizer for clarity.
- Added
CharDelimiterSplit
: a newPreTokenizer
that allows splitting sequences on the given delimiter (Works like.split(delimiter)
) - Added
WordLevel
: a new model that simply mapstokens
to theirids
. - Improve truncation/padding and the handling of overflowing tokens. Now when a sequence gets truncated, we provide a list of overflowing
Encoding
that are ready to be processed by a language model, just as the mainEncoding
. - Provide mapping to the original string offsets using:
output = tokenizer.encode(...)
print(output.original_str.offsets(output.offsets[3]))
Bug fixes:
- Fix a bug with IndexableString
- Fix a bug with truncation
Python v0.2.1
- Fix a bug with the IDs associated with added tokens.
- Fix a bug that was causing crashes in Python 3.5