Skip to content

Commit 681df50

Browse files
authored
Refactor Tokenizer -> BaseTokenizer (#1333)
This causes breaking changes and users will need to redownload the tokenizer files (`python scripts/download_tokenizer.py ...`) - Remove `tiktoken` dependency, remove `tiktoken.py` - Refactor the `Tokenizer` base class - Update config files to point to directory instead of `tokenizer.model` - Raise exception if using `tokenizer.model` for tokenizer_path
1 parent 8518306 commit 681df50

40 files changed

+2257
-2454
lines changed

.ci/docker/requirements-dev.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,4 @@ pytest==7.3.2
33
pytest-cov
44
pre-commit
55
tomli-w >= 1.1.0
6+
transformers

.ci/docker/requirements.txt

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,6 @@ torchdata >= 0.8.0
22
datasets >= 3.6.0
33
tomli >= 1.1.0 ; python_version < "3.11"
44
tensorboard
5-
tiktoken
6-
blobfile
75
tabulate
86
wandb
97
fsspec

CONTRIBUTING.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ We actively welcome your pull requests.
1414
2. If you've added code that should be tested, add tests.
1515
3. If you've changed APIs, update the documentation.
1616
4. Ensure the test suite passes.
17-
5. Make sure your code lints (`pre-commit run --all-files`).
17+
5. Make sure your code lints (`pre-commit run --files $(git diff --name-only HEAD~1)`).
1818
6. If you haven't already, complete the Contributor License Agreement ("CLA").
1919

2020
### Contributor License Agreement ("CLA")

pyproject.toml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,7 @@ dependencies = [
1717
"datasets>=2.21.0",
1818

1919
# Tokenization
20-
"blobfile",
21-
"tiktoken",
20+
"tokenizers",
2221

2322
# Miscellaneous
2423
"tomli>=1.1.0",

scripts/download_tokenizer.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -108,7 +108,7 @@ def is_tokenizer_file(filename: str) -> bool:
108108
print(f"Successfully downloaded {filename} to {file_path}")
109109
downloaded_files.append(filename)
110110
except HTTPError as e:
111-
if e.response.status_code == 404:
111+
if e.response and e.response.status_code == 404:
112112
print(f"File {filename} not found, skipping...")
113113
continue
114114
else:
@@ -122,7 +122,7 @@ def is_tokenizer_file(filename: str) -> bool:
122122
print(f"Warning: No tokenizer files could be downloaded from {repo_id}")
123123

124124
except HTTPError as e:
125-
if e.response.status_code == 401:
125+
if e.response and e.response.status_code == 401:
126126
print(
127127
"You need to pass a valid `--hf_token=...` to download private checkpoints."
128128
)

scripts/generate/test_generate.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -165,7 +165,7 @@ def test_generate(
165165
input_ids = (
166166
(
167167
torch.tensor(
168-
tokenizer.encode(prompt, bos=True, eos=False), dtype=torch.long
168+
tokenizer.encode(prompt, add_bos=True, add_eos=False), dtype=torch.long
169169
)
170170
.view(1, -1)
171171
.repeat(batch_size, 1)

0 commit comments

Comments
 (0)