Refactor Tokenizer -> BaseTokenizer #1333

H-Huang · 2025-06-24T19:37:13Z

This causes breaking changes and users will need to redownload the tokenizer files (python scripts/download_tokenizer.py ...)

Remove tiktoken dependency, remove tiktoken.py
Refactor the Tokenizer base class
Update config files to point to directory instead of tokenizer.model
Raise exception if using tokenizer.model for tokenizer_path

tianyu-l

Looks great in general. Left some comments.

.ci/docker/requirements.txt

.ci/docker/requirements-dev.txt

tianyu-l · 2025-07-03T22:08:18Z

CONTRIBUTING.md

@@ -14,7 +14,7 @@ We actively welcome your pull requests.
 2. If you've added code that should be tested, add tests.
 3. If you've changed APIs, update the documentation.
 4. Ensure the test suite passes.
-5. Make sure your code lints (`pre-commit run --all-files`).
+5. Make sure your code lints (`pre-commit run --from-ref origin/main --to-ref HEAD`).


IIUC this is restricting the linting to be changes between current main and the latest commit. Can I ask why?

I needed to update this because when I was running pre-commit run --all-files, it includes files not in my pending changes so I was getting a lot of errors for files under torchtitan/experiments/kernels/. I updated this it to be pre-commit run --files $(git diff --name-only HEAD~1) which is more general

I was getting a lot of errors for files under torchtitan/experiments/kernels/

That sounds very bad. I'm not sure how it escaped the check in https://github.com/pytorch/torchtitan/blob/main/.github/workflows/lint.yaml#L37

I don't think we should change official CONTRIBUTING.md to accommodate bad behavior. We should either refactor the code to make sure lint passes in all conditions, or just delete those code. We can do this in separate PRs.

That sounds very bad. I'm not sure how it escaped the check in https://github.com/pytorch/torchtitan/blob/main/.github/workflows/lint.yaml#L37

This only lints the changed files, but yeah not sure how it was checked-in in the first place. Will do that in follow up, seems doable to update

tianyu-l · 2025-07-03T22:11:08Z

tests/assets/tokenizer/tokenizer.json

what's the source of the files under tests/assets/tokenizer?
asking because not sure about legal side of things

The old test_tiktoken.model had a truncated vocabluary of 2000 based on tokenizer.model. Likewise I chatGPTed a python script that takes tokenizer.json and reduces it to 2000 tokens.

So this isn't pulled from anywhere, but it is derived from llama-3.1-8B like before, which i assume is okay

Previously the tokenizer.model is a fake one taken from some other repo.
What matters is not vocab size, but legal issue -- we can't use llama tokenizer directly due to legal reasons. If your new tokenizer is capable of dealing with .model tokenizer, could we keep the old one?

So we want to keep test_tiktoken.model because of CI tests? I remember CI do has access to network. Can we let integration test to download the tokenizer from Llama repo, and then run the integration test, does this sounds all right?

This PR adds tests/assets/tokenizer/tokenizer.json which is functionally the same as test_tiktoken.model and I updated the .toml and tests to reference this instead

torchtitan/components/tokenizer.py

torchtitan/experiments/llama4/train_configs/llama4_17bx128e.toml

torchtitan/experiments/llama4/train_configs/llama4_17bx16e.toml

torchtitan/experiments/multimodal/tokenizer/tiktoken.py

torchtitan/models/llama3/train_configs/llama3_405b.toml

tianyu-l

LGTM, let's fix the linting thing in followup PRs.

Also please make sure CI passes / understand the error is not related.

H-Huang · 2025-07-10T19:25:19Z

After #1343, these changes are causing this test:

torchrun --nproc_per_node=4 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0,1,2,3 --role rank --tee 3 -m torchtitan.train --job.config_file ./torchtitan/models/llama3/train_configs/debug_model.toml --model.name llama3_simple_fsdp --training.compile --job.dump_folder artifacts-to-be-uploaded/2d --parallelism.tensor_parallel_degree 2

to fail. The hypothesis is that embedding layer doesn't have padding and uneven sharding might result in this bug

H-Huang · 2025-07-10T19:39:21Z

Update after changing the test vocab_size from 2002 -> 2000, now the test above is passing

mori360 · 2025-07-10T22:59:08Z

torchtitan/models/llama3/train_configs/llama3_405b.toml

@@ -18,7 +18,7 @@ save_tb_folder = "tb"
 [model]
 name = "llama3"
 flavor = "405B"
-tokenizer_path = "./assets/tokenizer/original/tokenizer.model"
+tokenizer_path = "./assets/tokenizer/Llama-3.1-8B"


Path is inconsistent with the download path here:

torchtitan/README.md

Line 107 in 681df50

python scripts/download_tokenizer.py --repo_id meta-llama/Meta-Llama-3.1-8B --hf_token=...

My bad, surprisingly both of these will work:
python scripts/download_tokenizer.py --repo_id meta-llama/Llama-3.1-8B
python scripts/download_tokenizer.py --repo_id meta-llama/Meta-Llama-3.1-8B

but the README and toml are using different folder names, I will update the README to just specify meta-llama/Llama-3.1-8B, please do
python scripts/download_tokenizer.py --repo_id meta-llama/Llama-3.1-8B

Fix #1333 (comment) Downloading from both `Meta-Llama-3.1-8B` and `Llama-3.1-8B` work, but the README instructions and the .toml file are different. This changes so the README matches

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 24, 2025

H-Huang mentioned this pull request Jul 2, 2025

Support different tokenizers #1318

Merged

H-Huang force-pushed the tokenizer_changes branch 3 times, most recently from 76476bc to 8876a97 Compare July 3, 2025 15:18

H-Huang changed the title ~~[WIP] Refactor Tokenizer -> BaseTokenizer~~ Refactor Tokenizer -> BaseTokenizer Jul 3, 2025

H-Huang marked this pull request as ready for review July 3, 2025 16:20

H-Huang requested review from tianyu-l, fegin, wwwjn and wconstab as code owners July 3, 2025 16:20

tianyu-l reviewed Jul 3, 2025

View reviewed changes

H-Huang force-pushed the tokenizer_changes branch from 8876a97 to 4883438 Compare July 8, 2025 19:00

tianyu-l mentioned this pull request Jul 9, 2025

[DSV3] Adding deepseek-v3 model into torchtitan #1373

Merged

H-Huang force-pushed the tokenizer_changes branch 2 times, most recently from 9b2bb40 to bbdca16 Compare July 10, 2025 15:14

tianyu-l approved these changes Jul 10, 2025

View reviewed changes

Refactor Tokenizer->BaseTokenizer

69dab01

H-Huang force-pushed the tokenizer_changes branch from bbdca16 to 69dab01 Compare July 10, 2025 19:34

H-Huang merged commit 681df50 into pytorch:main Jul 10, 2025
11 checks passed

mori360 reviewed Jul 10, 2025

View reviewed changes

H-Huang mentioned this pull request Jul 11, 2025

Fix tokenizer README repository ID #1381

Merged

Refactor Tokenizer -> BaseTokenizer #1333

Refactor Tokenizer -> BaseTokenizer #1333

Uh oh!

Conversation

H-Huang commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

H-Huang Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wwwjn Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

H-Huang commented Jul 10, 2025

Uh oh!

H-Huang commented Jul 10, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

H-Huang Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

H-Huang commented Jun 24, 2025 •

edited

Loading

H-Huang Jul 8, 2025 •

edited

Loading

wwwjn Jul 9, 2025 •

edited

Loading

H-Huang Jul 11, 2025 •

edited

Loading