Skip to content

Commit 0eb7455

Browse files
authored
Preparing 0.12 release. (#967)
* Preparing `0.12` release. * Fix click version: psf/black#2964
1 parent 28cd3dc commit 0eb7455

File tree

8 files changed

+64
-5
lines changed

8 files changed

+64
-5
lines changed

.github/workflows/python.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ jobs:
107107
working-directory: ./bindings/python
108108
run: |
109109
source .env/bin/activate
110-
pip install black==20.8b1
110+
pip install black==20.8b1 click==8.0.4
111111
make check-style
112112
113113
- name: Run tests

bindings/node/CHANGELOG.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,15 @@
1+
## [0.12.0]
2+
3+
Bump minor version because of a breaking change.
4+
Using `0.12` to match other bindings.
5+
6+
- [#938] **Breaking change**. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.
7+
- [#939] Making the regex in `ByteLevel` pre_tokenizer optional (necessary for BigScience)
8+
9+
- [#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)
10+
- [#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.
11+
- [#961] Added link for Ruby port of `tokenizers`
12+
113
# [0.8.0](https://github.com/huggingface/tokenizers/compare/node-v0.7.0...node-v0.8.0) (2021-09-02)
214

315
### BREACKING CHANGES
@@ -142,3 +154,12 @@ The files must now be provided first when calling `tokenizer.train(files, traine
142154
- Fix default special tokens in `BertWordPieceTokenizer` ([10e2d28](https://github.com/huggingface/tokenizers/commit/10e2d286caf517f0977c04cf8e1924aed90403c9))
143155
- Fix return type of `getSpecialTokensMask` on `Encoding` ([9770be5](https://github.com/huggingface/tokenizers/commit/9770be566175dc9c44dd7dcaa00a57d0e4ca632b))
144156
- Actually add special tokens in tokenizers implementations ([acef252](https://github.com/huggingface/tokenizers/commit/acef252dacc43adc414175cfc325668ad1488753))
157+
158+
159+
[#938]: https://github.com/huggingface/tokenizers/pull/938
160+
[#939]: https://github.com/huggingface/tokenizers/pull/939
161+
[#952]: https://github.com/huggingface/tokenizers/pull/952
162+
[#954]: https://github.com/huggingface/tokenizers/pull/954
163+
[#962]: https://github.com/huggingface/tokenizers/pull/962
164+
[#961]: https://github.com/huggingface/tokenizers/pull/961
165+
[#960]: https://github.com/huggingface/tokenizers/pull/960

bindings/node/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "tokenizers",
3-
"version": "0.8.3",
3+
"version": "0.12.0",
44
"description": "",
55
"main": "./dist/index.js",
66
"types": "./dist/index.d.ts",

bindings/python/CHANGELOG.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,18 @@ All notable changes to this project will be documented in this file.
44
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
55
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
66

7+
## [0.12.0]
8+
9+
Bump minor version because of a breaking change.
10+
11+
- [#938] **Breaking change**. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.
12+
- [#939] Making the regex in `ByteLevel` pre_tokenizer optional (necessary for BigScience)
13+
14+
- [#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)
15+
- [#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.
16+
- [#962] Fix tests for python 3.10
17+
- [#961] Added link for Ruby port of `tokenizers`
18+
719
## [0.11.6]
820

921
- [#919] Fixing single_word AddedToken. (regression from 0.11.2)
@@ -360,6 +372,13 @@ delimiter (Works like `.split(delimiter)`)
360372
- Fix a bug that was causing crashes in Python 3.5
361373

362374

375+
[#938]: https://github.com/huggingface/tokenizers/pull/938
376+
[#939]: https://github.com/huggingface/tokenizers/pull/939
377+
[#952]: https://github.com/huggingface/tokenizers/pull/952
378+
[#954]: https://github.com/huggingface/tokenizers/pull/954
379+
[#962]: https://github.com/huggingface/tokenizers/pull/962
380+
[#961]: https://github.com/huggingface/tokenizers/pull/961
381+
[#960]: https://github.com/huggingface/tokenizers/pull/960
363382
[#919]: https://github.com/huggingface/tokenizers/pull/919
364383
[#916]: https://github.com/huggingface/tokenizers/pull/916
365384
[#895]: https://github.com/huggingface/tokenizers/pull/895

bindings/python/py_src/tokenizers/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
__version__ = "0.11.6"
1+
__version__ = "0.12.0"
22

33
from typing import Tuple, Union, Tuple, List
44
from enum import Enum

bindings/python/setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77

88
setup(
99
name="tokenizers",
10-
version="0.11.6",
10+
version="0.12.0",
1111
description="Fast and Customizable Tokenizers",
1212
long_description=open("README.md", "r", encoding="utf-8").read(),
1313
long_description_content_type="text/markdown",

tokenizers/CHANGELOG.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,18 @@ All notable changes to this project will be documented in this file.
44
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
55
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
66

7+
## [0.12.0]
8+
9+
Bump minor version because of a breaking change.
10+
11+
- [#938] **Breaking change**. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.
12+
- [#939] Making the regex in `ByteLevel` pre_tokenizer optional (necessary for BigScience)
13+
14+
- [#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)
15+
- [#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.
16+
- [#961] Added link for Ruby port of `tokenizers`
17+
- [#960] Feature gate for `cli` and its `clap` dependency
18+
719
## [0.11.3]
820

921
- [#919] Fixing single_word AddedToken. (regression from 0.11.2)
@@ -140,6 +152,13 @@ advised, but that's not the question)
140152
split up in multiple bytes
141153
- [#174]: The `LongestFirst` truncation strategy had a bug
142154

155+
156+
[#938]: https://github.com/huggingface/tokenizers/pull/938
157+
[#939]: https://github.com/huggingface/tokenizers/pull/939
158+
[#952]: https://github.com/huggingface/tokenizers/pull/952
159+
[#954]: https://github.com/huggingface/tokenizers/pull/954
160+
[#961]: https://github.com/huggingface/tokenizers/pull/961
161+
[#960]: https://github.com/huggingface/tokenizers/pull/960
143162
[#919]: https://github.com/huggingface/tokenizers/pull/919
144163
[#916]: https://github.com/huggingface/tokenizers/pull/916
145164
[#884]: https://github.com/huggingface/tokenizers/pull/884

tokenizers/Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
authors = ["Anthony MOI <m.anthony.moi@gmail.com>"]
33
edition = "2018"
44
name = "tokenizers"
5-
version = "0.11.3"
5+
version = "0.12.0"
66
homepage = "https://github.com/huggingface/tokenizers"
77
repository = "https://github.com/huggingface/tokenizers"
88
documentation = "https://docs.rs/tokenizers/"

0 commit comments

Comments
 (0)