Skip to content

Refactor Tokenizer -> BaseTokenizer #1333

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 10, 2025
Merged

Conversation

H-Huang
Copy link
Member

@H-Huang H-Huang commented Jun 24, 2025

This causes breaking changes and users will need to redownload the tokenizer files (python scripts/download_tokenizer.py ...)

  • Remove tiktoken dependency, remove tiktoken.py
  • Refactor the Tokenizer base class
  • Update config files to point to directory instead of tokenizer.model
  • Raise exception if using tokenizer.model for tokenizer_path

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 24, 2025
@H-Huang H-Huang force-pushed the tokenizer_changes branch 3 times, most recently from 76476bc to 8876a97 Compare July 3, 2025 15:18
@H-Huang H-Huang changed the title [WIP] Refactor Tokenizer -> BaseTokenizer Refactor Tokenizer -> BaseTokenizer Jul 3, 2025
@H-Huang H-Huang marked this pull request as ready for review July 3, 2025 16:20
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great in general. Left some comments.

CONTRIBUTING.md Outdated
@@ -14,7 +14,7 @@ We actively welcome your pull requests.
2. If you've added code that should be tested, add tests.
3. If you've changed APIs, update the documentation.
4. Ensure the test suite passes.
5. Make sure your code lints (`pre-commit run --all-files`).
5. Make sure your code lints (`pre-commit run --from-ref origin/main --to-ref HEAD`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is restricting the linting to be changes between current main and the latest commit. Can I ask why?

Copy link
Member Author

@H-Huang H-Huang Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to update this because when I was running pre-commit run --all-files, it includes files not in my pending changes so I was getting a lot of errors for files under torchtitan/experiments/kernels/. I updated this it to be pre-commit run --files $(git diff --name-only HEAD~1) which is more general

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was getting a lot of errors for files under torchtitan/experiments/kernels/

That sounds very bad. I'm not sure how it escaped the check in https://github.com/pytorch/torchtitan/blob/main/.github/workflows/lint.yaml#L37

I don't think we should change official CONTRIBUTING.md to accommodate bad behavior. We should either refactor the code to make sure lint passes in all conditions, or just delete those code. We can do this in separate PRs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds very bad. I'm not sure how it escaped the check in https://github.com/pytorch/torchtitan/blob/main/.github/workflows/lint.yaml#L37

This only lints the changed files, but yeah not sure how it was checked-in in the first place. Will do that in follow up, seems doable to update

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the source of the files under tests/assets/tokenizer?
asking because not sure about legal side of things

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old test_tiktoken.model had a truncated vocabluary of 2000 based on tokenizer.model. Likewise I chatGPTed a python script that takes tokenizer.json and reduces it to 2000 tokens.

So this isn't pulled from anywhere, but it is derived from llama-3.1-8B like before, which i assume is okay

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously the tokenizer.model is a fake one taken from some other repo.
What matters is not vocab size, but legal issue -- we can't use llama tokenizer directly due to legal reasons. If your new tokenizer is capable of dealing with .model tokenizer, could we keep the old one?

Copy link
Contributor

@wwwjn wwwjn Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we want to keep test_tiktoken.model because of CI tests? I remember CI do has access to network. Can we let integration test to download the tokenizer from Llama repo, and then run the integration test, does this sounds all right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds tests/assets/tokenizer/tokenizer.json which is functionally the same as test_tiktoken.model and I updated the .toml and tests to reference this instead

@H-Huang H-Huang force-pushed the tokenizer_changes branch from 8876a97 to 4883438 Compare July 8, 2025 19:00
@H-Huang H-Huang force-pushed the tokenizer_changes branch 2 times, most recently from 9b2bb40 to bbdca16 Compare July 10, 2025 15:14
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's fix the linting thing in followup PRs.

Also please make sure CI passes / understand the error is not related.

@H-Huang
Copy link
Member Author

H-Huang commented Jul 10, 2025

After #1343, these changes are causing this test:

torchrun --nproc_per_node=4 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0,1,2,3 --role rank --tee 3 -m torchtitan.train --job.config_file ./torchtitan/models/llama3/train_configs/debug_model.toml --model.name llama3_simple_fsdp --training.compile --job.dump_folder artifacts-to-be-uploaded/2d --parallelism.tensor_parallel_degree 2

to fail. The hypothesis is that embedding layer doesn't have padding and uneven sharding might result in this bug

@H-Huang H-Huang force-pushed the tokenizer_changes branch from bbdca16 to 69dab01 Compare July 10, 2025 19:34
@H-Huang
Copy link
Member Author

H-Huang commented Jul 10, 2025

Update after changing the test vocab_size from 2002 -> 2000, now the test above is passing

@H-Huang H-Huang merged commit 681df50 into pytorch:main Jul 10, 2025
11 checks passed
@@ -18,7 +18,7 @@ save_tb_folder = "tb"
[model]
name = "llama3"
flavor = "405B"
tokenizer_path = "./assets/tokenizer/original/tokenizer.model"
tokenizer_path = "./assets/tokenizer/Llama-3.1-8B"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path is inconsistent with the download path here:

python scripts/download_tokenizer.py --repo_id meta-llama/Meta-Llama-3.1-8B --hf_token=...

Copy link
Member Author

@H-Huang H-Huang Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, surprisingly both of these will work:
python scripts/download_tokenizer.py --repo_id meta-llama/Llama-3.1-8B
python scripts/download_tokenizer.py --repo_id meta-llama/Meta-Llama-3.1-8B

but the README and toml are using different folder names, I will update the README to just specify meta-llama/Llama-3.1-8B, please do
python scripts/download_tokenizer.py --repo_id meta-llama/Llama-3.1-8B

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix: #1381

H-Huang added a commit that referenced this pull request Jul 11, 2025
Fix
#1333 (comment)

Downloading from both `Meta-Llama-3.1-8B` and `Llama-3.1-8B` work, but
the README instructions and the .toml file are different. This changes
so the README matches
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants