-
Notifications
You must be signed in to change notification settings - Fork 431
Refactor Tokenizer -> BaseTokenizer #1333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
76476bc
to
8876a97
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great in general. Left some comments.
CONTRIBUTING.md
Outdated
@@ -14,7 +14,7 @@ We actively welcome your pull requests. | |||
2. If you've added code that should be tested, add tests. | |||
3. If you've changed APIs, update the documentation. | |||
4. Ensure the test suite passes. | |||
5. Make sure your code lints (`pre-commit run --all-files`). | |||
5. Make sure your code lints (`pre-commit run --from-ref origin/main --to-ref HEAD`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this is restricting the linting to be changes between current main and the latest commit. Can I ask why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed to update this because when I was running pre-commit run --all-files
, it includes files not in my pending changes so I was getting a lot of errors for files under torchtitan/experiments/kernels/
. I updated this it to be pre-commit run --files $(git diff --name-only HEAD~1)
which is more general
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was getting a lot of errors for files under torchtitan/experiments/kernels/
That sounds very bad. I'm not sure how it escaped the check in https://github.com/pytorch/torchtitan/blob/main/.github/workflows/lint.yaml#L37
I don't think we should change official CONTRIBUTING.md
to accommodate bad behavior. We should either refactor the code to make sure lint passes in all conditions, or just delete those code. We can do this in separate PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds very bad. I'm not sure how it escaped the check in https://github.com/pytorch/torchtitan/blob/main/.github/workflows/lint.yaml#L37
This only lints the changed files, but yeah not sure how it was checked-in in the first place. Will do that in follow up, seems doable to update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the source of the files under tests/assets/tokenizer
?
asking because not sure about legal side of things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old test_tiktoken.model
had a truncated vocabluary of 2000 based on tokenizer.model
. Likewise I chatGPTed a python script that takes tokenizer.json and reduces it to 2000 tokens.
So this isn't pulled from anywhere, but it is derived from llama-3.1-8B like before, which i assume is okay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously the tokenizer.model
is a fake one taken from some other repo.
What matters is not vocab size, but legal issue -- we can't use llama tokenizer directly due to legal reasons. If your new tokenizer is capable of dealing with .model
tokenizer, could we keep the old one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we want to keep test_tiktoken.model
because of CI tests? I remember CI do has access to network. Can we let integration test to download the tokenizer from Llama repo, and then run the integration test, does this sounds all right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR adds tests/assets/tokenizer/tokenizer.json
which is functionally the same as test_tiktoken.model
and I updated the .toml
and tests to reference this instead
torchtitan/experiments/llama4/train_configs/llama4_17bx128e.toml
Outdated
Show resolved
Hide resolved
torchtitan/experiments/llama4/train_configs/llama4_17bx16e.toml
Outdated
Show resolved
Hide resolved
8876a97
to
4883438
Compare
9b2bb40
to
bbdca16
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, let's fix the linting thing in followup PRs.
Also please make sure CI passes / understand the error is not related.
After #1343, these changes are causing this test:
to fail. The hypothesis is that embedding layer doesn't have padding and uneven sharding might result in this bug |
bbdca16
to
69dab01
Compare
Update after changing the test vocab_size from 2002 -> 2000, now the test above is passing |
@@ -18,7 +18,7 @@ save_tb_folder = "tb" | |||
[model] | |||
name = "llama3" | |||
flavor = "405B" | |||
tokenizer_path = "./assets/tokenizer/original/tokenizer.model" | |||
tokenizer_path = "./assets/tokenizer/Llama-3.1-8B" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Path is inconsistent with the download path here:
Line 107 in 681df50
python scripts/download_tokenizer.py --repo_id meta-llama/Meta-Llama-3.1-8B --hf_token=... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, surprisingly both of these will work:
python scripts/download_tokenizer.py --repo_id meta-llama/Llama-3.1-8B
python scripts/download_tokenizer.py --repo_id meta-llama/Meta-Llama-3.1-8B
but the README and toml are using different folder names, I will update the README to just specify meta-llama/Llama-3.1-8B
, please do
python scripts/download_tokenizer.py --repo_id meta-llama/Llama-3.1-8B
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix: #1381
Fix #1333 (comment) Downloading from both `Meta-Llama-3.1-8B` and `Llama-3.1-8B` work, but the README instructions and the .toml file are different. This changes so the README matches
This causes breaking changes and users will need to redownload the tokenizer files (
python scripts/download_tokenizer.py ...
)tiktoken
dependency, removetiktoken.py
Tokenizer
base classtokenizer.model
tokenizer.model
for tokenizer_path