[llama3] add configurations for Llama 3 1B and 3B models #1376

idoh · 2025-07-09T12:17:54Z

Based on the official Llama 3.2 1B and 3B HF configurations, I added configuration support for torchtitan. Getting the ffn_dim_multiplier is tricky as I saw a previous attempt at getting the numbers right. After looking at the FF implementation and some analysis, the numbers perfectly match the official architecture.

"1B": TransformerModelArgs(
    dim=2048,
    n_layers=16,
    n_heads=32,
    n_kv_heads=8,
    ffn_dim_multiplier=1.5,
    multiple_of=1024,
    rope_theta=500000,
),
"3B": TransformerModelArgs(
    dim=3072,
    n_layers=28,
    n_heads=24,
    n_kv_heads=8,
    ffn_dim_multiplier=1.0,
    multiple_of=1024,
    rope_theta=500000,
),

For reference, here is the 8B configuration.

"8B": TransformerModelArgs(
    dim=4096,
    n_layers=32,
    n_heads=32,
    n_kv_heads=8,
    ffn_dim_multiplier=1.3,
    multiple_of=1024,
    rope_theta=500000,
),

The parameter count for each of the configs.
1B: INFO - Model llama3 1B size: 1,498,482,688 total parameters
3B: INFO - Model llama3 3B size: 3,606,752,256 total parameters

I know that the parameter counts are slightly larger, but this is due to the vocabulary embeddings which are typically not counted in the total parameter count for smaller models.

tianyu-l

The parameter count for each of the configs.
1B: INFO - Model llama3 1B size: 1,498,482,688 total parameters
3B: INFO - Model llama3 3B size: 3,606,752,256 total parameters
I know that the parameter counts are slightly larger, but this is due to the vocabulary embeddings which are typically not counted in the total parameter count for smaller models.

I'm not sure if I follow your argument. With the official tokenizer, the model sizes match perfectly with the document here.
https://github.com/pytorch/torchtitan/pull/1040/files#r2043580247
I'm not sure why you are not seeing the same numbers -- is it because you are using an even larger vocab size?

Also per my statement, without weight-tying support in torchtitan, the models won't be exactly the same as Llama 3.2. Do you think this is OK for your use case? (would appreciate if you share more about your use cases)

tianyu-l · 2025-07-10T02:07:01Z

torchtitan/models/llama3/train_configs/llama3_1b.toml

I don't think these files are particularly interesting, and also there's no evidence such setting (e.g. batch size, learning rate) would provide stable training and good throughputs.

It's true that the batch size and learning rate might need to change for optimal settings, however, the existing ones should be a sufficient starting point.

As for the weight-tying, I think it is fine, as this is just a more relaxed configuration with higher expressivity, so if it used for inference or fine-tuning, the starting point would be identical to Llama 1B and 3B.

Finally, I think there is a lot of value in adding these configurations as labs even if they have the funds, have at most one chance to train a large 70B model, whereas most ablations and testing are conducted on smaller sizes such as 1B or 3B.

Finally, I think there is a lot of value in adding these configurations as labs even if they have the funds, have at most one chance to train a large 70B model, whereas most ablations and testing are conducted on smaller sizes such as 1B or 3B.

I'm still not 100% comfortable adding .toml configs that are not verified.
Given that the file is almost identical to llama3_8b.toml, users can just run

CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh --model.flavor 1B

What do you think?

Oh nice, I didn't think of that option. It depends if we want to optimize the hyper-parameters for a H100 setting. As with a 3B model we can fit a much larger batch size per GPU. I'm comfortable either way.

tianyu-l · 2025-07-10T02:07:53Z

torchtitan/models/llama3/__init__.py

+        n_layers=16,
+        n_heads=32,
+        n_kv_heads=8,
+        ffn_dim_multiplier=1.5,


Does this give the same results compared with when it is 1.4?
Asking because I've verified it works with 1.4
https://github.com/pytorch/torchtitan/pull/1040/files#r2043164137

Yes the 1.4 results in the same configuration, but 1.5 is more accurate in this case. The ffn_dim_multiplier to intermidiate_size conversion is a bit tricky, but it comes down to these lines of code:
https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama3/model/model.py#L272
https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama3/model/model.py#L226
https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama3/model/model.py#L230

So basically you take the dim size, multiply it by 4, then by 2/3, then by ffn_dim_multiplier and finally round it to the nearest multiple_of. In the case of 1.4 you get 7645.86 which thanks to the rounding, it is rounded to 8192, which is the correct answer. For 1.5 it is exactly 8192, which IMO is safer incase someone decides to change multiple_of.

cool, thanks for explaining

tianyu-l · 2025-07-14T04:50:01Z

torchtitan/models/llama3/__init__.py

@@ -39,6 +39,24 @@
        use_flex_attn=True,
        attn_mask_type="block_causal",
    ),
+    "1B": TransformerModelArgs(


As for the weight-tying, I think it is fine, as this is just a more relaxed configuration with higher expressivity, so if it used for inference or fine-tuning, the starting point would be identical to Llama 1B and 3B.

I'm OK with this, as long as you could help add some comments here for 1B and 3B

Suggested change

"1B": TransformerModelArgs(

# NOTE: The original model checkpoints of Llama 3.2 1B and 3B are provided

# with weight-tying between the embedding layer and the output layer,

# which is not supported in torchtitan.

"1B": TransformerModelArgs(

Also we plan to provide ways to load HF checkpoints into torchtitan for training.
A note for us that the mapping designed for Llama 3.1 may not work for Llama 3.2 due to reasons like https://www.reddit.com/r/LocalLLaMA/comments/1fzn8c9/where_did_llama_32s_language_modeling_head_go/

tianyu-l · 2025-07-14T04:52:14Z

torchtitan/models/llama3/__init__.py

+        n_layers=16,
+        n_heads=32,
+        n_kv_heads=8,
+        ffn_dim_multiplier=1.5,


cool, thanks for explaining

tianyu-l · 2025-07-14T05:00:21Z

torchtitan/models/llama3/train_configs/llama3_1b.toml

Finally, I think there is a lot of value in adding these configurations as labs even if they have the funds, have at most one chance to train a large 70B model, whereas most ablations and testing are conducted on smaller sizes such as 1B or 3B.

I'm still not 100% comfortable adding .toml configs that are not verified.
Given that the file is almost identical to llama3_8b.toml, users can just run

CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh --model.flavor 1B

What do you think?

ebsmothers · 2025-07-14T16:31:14Z

@tianyu-l @idoh I was just lurking on this PR, and my two cents is that this PR should not be landed as is. If the intent is to provide canonical examples of pretraining Llama family models then without the tied weights this is not actually equivalent to any of the Llama 3.2 family of models.

As for the weight-tying, I think it is fine, as this is just a more relaxed configuration with higher expressivity, so if it used for inference or fine-tuning, the starting point would be identical to Llama 1B and 3B.

I don't think this point about "higher expressivity" is quite true. If the set of tied-weights models are a proper subset of what's enabled by this PR (e.g. if the PR included a bool tie_weight_embeddings in TransformerModelArgs) that would be one thing, but here it is actually that weight-tying is not supported. So as is I actually cannot finetune a Llama 3.2 model using this configuration -- while the weights I load in may be identical, without the same underlying data pointers they will immediately diverge once I start training.

tianyu-l · 2025-07-14T17:15:46Z

@ebsmothers
I agree with you arguments in general and specifically I also had this worry

without the same underlying data pointers they will immediately diverge once I start training.

However, my question for the users are if this is OK to them, in the sense that the diverged weights should be still "usable".
I'm not sure how much loss could improve from weight-tying (e.g. via better gradient flow?), but one obvious advantage is memory save. Now if in fine-tuning I don't need the memory save and remove the tie, can it result in even better model behavior? I'm not aware of study like this, but I won't be surprised if a random paper be titled "remove weight-tying during finetuning is all you need".

A not-so-good analogy may be LoRA, where finetuning is not training the original model.

After all, I agree this approach is not principled and there is no evidence how it works. But the question is should we unblock users from doing it. Maybe we should just work hard and support weight-tying.

ebsmothers · 2025-07-14T17:22:01Z

@tianyu-l thanks, your points make sense to me. I do agree that it could be reasonable to untie the weights for fine-tuning, in general there is no single "correct" way to do it (to give another example of your LoRA analogy, there are also multimodal models where only some of the parameters are updated during finetuning). But if we do treat these as somewhat canonical model definitions then I think we should not allow people to import the model definition without the tied weights (this becomes even more relevant for some of the scaled models work).

Anyways obviously up to you how to proceed, just wanted to give my unsolicited opinion here 😃

[llama3] add configurations for Llama 3 1B and 3B models

709d0e3

idoh requested review from tianyu-l, fegin, wwwjn and wconstab as code owners July 9, 2025 12:17

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 9, 2025

tianyu-l reviewed Jul 10, 2025

View reviewed changes

tianyu-l reviewed Jul 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[llama3] add configurations for Llama 3 1B and 3B models #1376

[llama3] add configurations for Llama 3 1B and 3B models #1376

Uh oh!

idoh commented Jul 9, 2025 •

edited

Loading

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l Jul 10, 2025

Uh oh!

idoh Jul 10, 2025

Uh oh!

tianyu-l Jul 14, 2025

Uh oh!

idoh Jul 15, 2025

Uh oh!

tianyu-l Jul 10, 2025

Uh oh!

idoh Jul 10, 2025 •

edited

Loading

Uh oh!

tianyu-l Jul 14, 2025

Uh oh!

tianyu-l Jul 14, 2025

Uh oh!

tianyu-l Jul 14, 2025

Uh oh!

tianyu-l Jul 14, 2025

Uh oh!

tianyu-l Jul 14, 2025

Uh oh!

ebsmothers commented Jul 14, 2025

Uh oh!

tianyu-l commented Jul 14, 2025

Uh oh!

ebsmothers commented Jul 14, 2025

Uh oh!

Uh oh!

-    "1B": TransformerModelArgs(
+    # NOTE: The original model checkpoints of Llama 3.2 1B and 3B are provided
+    # with weight-tying between the embedding layer and the output layer,
+    # which is not supported in torchtitan.
+    "1B": TransformerModelArgs(

[llama3] add configurations for Llama 3 1B and 3B models #1376

Are you sure you want to change the base?

[llama3] add configurations for Llama 3 1B and 3B models #1376

Uh oh!

Conversation

idoh commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

idoh Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebsmothers commented Jul 14, 2025

Uh oh!

tianyu-l commented Jul 14, 2025

Uh oh!

ebsmothers commented Jul 14, 2025

Uh oh!

Uh oh!

idoh commented Jul 9, 2025 •

edited

Loading

idoh Jul 10, 2025 •

edited

Loading