-
Notifications
You must be signed in to change notification settings - Fork 427
[llama3] add configurations for Llama 3 1B and 3B models #1376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter count for each of the configs.
1B: INFO - Model llama3 1B size: 1,498,482,688 total parameters
3B: INFO - Model llama3 3B size: 3,606,752,256 total parameters
I know that the parameter counts are slightly larger, but this is due to the vocabulary embeddings which are typically not counted in the total parameter count for smaller models.
I'm not sure if I follow your argument. With the official tokenizer, the model sizes match perfectly with the document here.
https://github.com/pytorch/torchtitan/pull/1040/files#r2043580247
I'm not sure why you are not seeing the same numbers -- is it because you are using an even larger vocab size?
Also per my statement, without weight-tying support in torchtitan, the models won't be exactly the same as Llama 3.2. Do you think this is OK for your use case? (would appreciate if you share more about your use cases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these files are particularly interesting, and also there's no evidence such setting (e.g. batch size, learning rate) would provide stable training and good throughputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true that the batch size and learning rate might need to change for optimal settings, however, the existing ones should be a sufficient starting point.
As for the weight-tying, I think it is fine, as this is just a more relaxed configuration with higher expressivity, so if it used for inference or fine-tuning, the starting point would be identical to Llama 1B and 3B.
Finally, I think there is a lot of value in adding these configurations as labs even if they have the funds, have at most one chance to train a large 70B model, whereas most ablations and testing are conducted on smaller sizes such as 1B or 3B.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally, I think there is a lot of value in adding these configurations as labs even if they have the funds, have at most one chance to train a large 70B model, whereas most ablations and testing are conducted on smaller sizes such as 1B or 3B.
I'm still not 100% comfortable adding .toml
configs that are not verified.
Given that the file is almost identical to llama3_8b.toml
, users can just run
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh --model.flavor 1B
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nice, I didn't think of that option. It depends if we want to optimize the hyper-parameters for a H100 setting. As with a 3B model we can fit a much larger batch size per GPU. I'm comfortable either way.
n_layers=16, | ||
n_heads=32, | ||
n_kv_heads=8, | ||
ffn_dim_multiplier=1.5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this give the same results compared with when it is 1.4?
Asking because I've verified it works with 1.4
https://github.com/pytorch/torchtitan/pull/1040/files#r2043164137
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the 1.4
results in the same configuration, but 1.5
is more accurate in this case. The ffn_dim_multiplier
to intermidiate_size
conversion is a bit tricky, but it comes down to these lines of code:
https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama3/model/model.py#L272
https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama3/model/model.py#L226
https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama3/model/model.py#L230
So basically you take the dim size, multiply it by 4
, then by 2/3
, then by ffn_dim_multiplier
and finally round it to the nearest multiple_of
. In the case of 1.4
you get 7645.86
which thanks to the rounding, it is rounded to 8192
, which is the correct answer. For 1.5
it is exactly 8192, which IMO is safer incase someone decides to change multiple_of
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, thanks for explaining
@@ -39,6 +39,24 @@ | |||
use_flex_attn=True, | |||
attn_mask_type="block_causal", | |||
), | |||
"1B": TransformerModelArgs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the weight-tying, I think it is fine, as this is just a more relaxed configuration with higher expressivity, so if it used for inference or fine-tuning, the starting point would be identical to Llama 1B and 3B.
I'm OK with this, as long as you could help add some comments here for 1B and 3B
"1B": TransformerModelArgs( | |
# NOTE: The original model checkpoints of Llama 3.2 1B and 3B are provided | |
# with weight-tying between the embedding layer and the output layer, | |
# which is not supported in torchtitan. | |
"1B": TransformerModelArgs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also we plan to provide ways to load HF checkpoints into torchtitan for training.
A note for us that the mapping designed for Llama 3.1 may not work for Llama 3.2 due to reasons like https://www.reddit.com/r/LocalLLaMA/comments/1fzn8c9/where_did_llama_32s_language_modeling_head_go/
n_layers=16, | ||
n_heads=32, | ||
n_kv_heads=8, | ||
ffn_dim_multiplier=1.5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, thanks for explaining
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally, I think there is a lot of value in adding these configurations as labs even if they have the funds, have at most one chance to train a large 70B model, whereas most ablations and testing are conducted on smaller sizes such as 1B or 3B.
I'm still not 100% comfortable adding .toml
configs that are not verified.
Given that the file is almost identical to llama3_8b.toml
, users can just run
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh --model.flavor 1B
What do you think?
@tianyu-l @idoh I was just lurking on this PR, and my two cents is that this PR should not be landed as is. If the intent is to provide canonical examples of pretraining Llama family models then without the tied weights this is not actually equivalent to any of the Llama 3.2 family of models.
I don't think this point about "higher expressivity" is quite true. If the set of tied-weights models are a proper subset of what's enabled by this PR (e.g. if the PR included a bool |
@ebsmothers
However, my question for the users are if this is OK to them, in the sense that the diverged weights should be still "usable". A not-so-good analogy may be LoRA, where finetuning is not training the original model. After all, I agree this approach is not principled and there is no evidence how it works. But the question is should we unblock users from doing it. Maybe we should just work hard and support weight-tying. |
@tianyu-l thanks, your points make sense to me. I do agree that it could be reasonable to untie the weights for fine-tuning, in general there is no single "correct" way to do it (to give another example of your LoRA analogy, there are also multimodal models where only some of the parameters are updated during finetuning). But if we do treat these as somewhat canonical model definitions then I think we should not allow people to import the model definition without the tied weights (this becomes even more relevant for some of the scaled models work). Anyways obviously up to you how to proceed, just wanted to give my unsolicited opinion here 😃 |
Based on the official Llama 3.2 1B and 3B HF configurations, I added configuration support for
torchtitan
. Getting theffn_dim_multiplier
is tricky as I saw a previous attempt at getting the numbers right. After looking at the FF implementation and some analysis, the numbers perfectly match the official architecture.For reference, here is the 8B configuration.
The parameter count for each of the configs.
1B:
INFO - Model llama3 1B size: 1,498,482,688 total parameters
3B:
INFO - Model llama3 3B size: 3,606,752,256 total parameters
I know that the parameter counts are slightly larger, but this is due to the vocabulary embeddings which are typically not counted in the total parameter count for smaller models.