Missing a layer normalization after a feedforward network

femtoGPT is not applying a layer normalization after adding the result of FFN and attention vector.

## femtoGPT implementation

link: https://github.com/keyvank/femtoGPT/blob/f0afe9e1b19c6d3497fc6b4c1f7ece4be89e5d08/src/gpt.rs#L282

code:

```rs
for l in 0..num_layers {
    // .. multihead attention and some stuffs...

    let bias2_params = g.alloc(
    Tensor::<f32>::zeros(&[embedding_degree]),
        true,
        format!("feedforward2_{}_bias", l),
    )?;
    let lin2_result = g.call(MatMul::new(), &[lin1_act, lin2_params])?;
    let lin2_bias_result = g.call(Add::new(), &[lin2_result, bias2_params])?;

    // Why not normalize this result?
    curr_inp = g.call(Add::new(), &[add_atten_norm, lin2_bias_result])?;
}
```

## paper

<img width="679" height="774" alt="Image" src="https://github.com/user-attachments/assets/cd466bdf-360e-4062-bc64-8f94eab34e86" />

In the paper, this part is `Add & Norm`, not `Add`.

Is it intentional or is it just a mistake? Or maybe it's my misunderstanding... Please correct me if I'm wrong.

EDIT: It seems like [nanoGPT's attention block](https://github.com/karpathy/nanoGPT/blob/93a43d9a5c22450bbf06e78da2cb6eeef084b717/model.py#L105) isn't normalizing the result of the addition. Maybe I've misread the paper...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Missing a layer normalization after a feedforward network #29

femtoGPT implementation

paper

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Missing a layer normalization after a feedforward network #29

Description

femtoGPT implementation

paper

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions