-
Notifications
You must be signed in to change notification settings - Fork 67
Open
Description
femtoGPT is not applying a layer normalization after adding the result of FFN and attention vector.
femtoGPT implementation
link:
Line 282 in f0afe9e
curr_inp = g.call(Add::new(), &[add_atten_norm, lin2_bias_result])?; |
code:
for l in 0..num_layers {
// .. multihead attention and some stuffs...
let bias2_params = g.alloc(
Tensor::<f32>::zeros(&[embedding_degree]),
true,
format!("feedforward2_{}_bias", l),
)?;
let lin2_result = g.call(MatMul::new(), &[lin1_act, lin2_params])?;
let lin2_bias_result = g.call(Add::new(), &[lin2_result, bias2_params])?;
// Why not normalize this result?
curr_inp = g.call(Add::new(), &[add_atten_norm, lin2_bias_result])?;
}
paper

In the paper, this part is Add & Norm
, not Add
.
Is it intentional or is it just a mistake? Or maybe it's my misunderstanding... Please correct me if I'm wrong.
EDIT: It seems like nanoGPT's attention block isn't normalizing the result of the addition. Maybe I've misread the paper...
Metadata
Metadata
Assignees
Labels
No labels