Transformers Don't Need LayerNorm at Inference Time

This repository contains the official implementation of "Transformers Don't Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and the Implications for Mechanistic Interpretability" which demonstrates that all layer normalization (LN) layers can be removed from GPT-2 models with only a small increase in validation loss.

Abstract

Layer-wise normalization (LN) is an essential component of virtually all transformer-based large language models. While its effects on training stability are well documented, its role at inference time is poorly understood. Additionally, LN layers hinder mechanistic interpretability by introducing additional nonlinearities and increasing the interconnectedness of individual model components.

We show that all LN layers can be removed from every GPT-2 model with only a small increase in validation loss (e.g. +0.03 cross-entropy loss for GPT-2 XL). Thus LN cannot play a substantial role in language modeling. We find that the amount of fine-tuning data needed for LN removal grows sublinearly with model parameters, suggesting scaling to larger models is feasible.

We release a suite of LN-free GPT-2 models on Hugging Face. Our work clarifies the role of LN layers in language modeling, showing that GPT-2-class models can function without LN layers. We hope that our LN-free analogues of the GPT-2 family of models will enable more precise interpretability research and improve our understanding of language models.

Key Findings

LN is not essential for inference: GPT-2 models can function without LN layers with minimal performance degradation
Sublinear scaling: The data needed for LN removal grows sublinearly with model parameters
Interpretability benefits: LN-free models enable more precise mechanistic interpretability research
Released models: We provide a suite of LN-free GPT-2 models for the research community

Released Models

We release a suite of LN-free GPT-2 models on Hugging Face that can be used for more precise interpretability research:

Usage

You can load the models with transformers:

from transformers import GPT2LMHeadModel
model = GPT2LMHeadModel.from_pretrained("schaeff/gpt2-small_ln-free300")

Evaluation Results

Reported values are mean cross-entropy losses for 10.2M tokens for The Pile and The Pile filtered and 4.5M tokens for the OpenWebText (OWT) validation set.

Model	FT steps	OWT (val)	The Pile	The Pile-filtered
OpenAI GPT-2 Small original	0	3.1006	2.8450	2.7899
schaeff GPT-2 Small vanilla	300	3.0126	2.8511	2.8112
schaeff GPT-2 Small LN-free	300	3.0797 [+0.0671]	2.8852 [+0.0402]	2.8757 [+0.0858]
OpenAI GPT-2 Medium original	0	2.8145	2.5163	2.5390
schaeff GPT-2 Medium vanilla	500	2.7390	2.5752	2.5724
schaeff GPT-2 Medium LN-free	500	2.7642 [+0.0252]	2.6579 [+0.1416]	2.6352 [+0.0962]
OpenAI GPT-2 Large original	0	2.6623	2.5320	2.4347
schaeff GPT-2 Large vanilla	600	2.6240	2.6233	2.5074
schaeff GPT-2 Large LN-free	600	2.6384 [+0.0144]	2.7504 [+0.2184]	2.5159 [+0.0812]
OpenAI GPT-2 XL original	0	2.5567	2.4436¹	2.3739
schaeff GPT-2 XL vanilla	800	2.4799	2.4673	2.3821
schaeff GPT-2 XL LN-free	800	2.5052 [+0.0253]	130.2197²	2.3992 [+0.0253]

Footnotes:

GPT-2 XL original: Median: 1.0103, 95 Percentile Range: [0.0005, 10.6193], 99.9% Percentile Range [≈0.0000, 43.0064]
GPT-2 XL LN-free: Median: 1.0937, 95 Percentile Range: [0.0004, 10.7548], 99.9% Percentile Range [≈0.0000, 48.6459]

Setup

You will need 80GB+ GPU memory for most schedules - typically A100.

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Training

The training script implements the fine-tuning procedure for removing layer-wise normalization from GPT-2 models. It is driven by schedules encoded in config.py. Training can be resumed from checkpoint; it will respect partially removed layer norm.

List training schedules:

python config.py list

Show specific config:

python config.py show {config_name}

Fine-tune away layer norm:

python train.py --mode without_ln --config {config_name}

Evaluation

The evaluation script calculates CE loss for several datasets to evaluate the output model. The script supports three formats:

Original Hugging Face transformers model
LN-free checkpoint
LN-free transformers model which has had its eps and weights scaled so that layer norm is effectively disabled

Evaluate on The Pile:

python eval_pile.py -m gpt2 -f transformers -b 8 -d pile-apollo

Reproduce paper results:

This task is compute intensive and takes about 4h on an A100.

chmod +x eval_all.sh
./eval_all.sh

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you have found our work useful please cite as:

@misc{gpt2layernorm2025,
  author = {Baroni, Luca and Khara, Galvin and Schaeffer, Joachim and Subkhankulov, Marat and Heimersheim, Stefan},
  title = {Transformers Don't Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and the Implications for Mechanistic Interpretability},
  year = {2025},
  eprint = {2507.02559},
  archivePrefix = {arXiv},
  primaryClass = {cs.LG},
  url = {https://arxiv.org/abs/2507.02559v1}
}

Name		Name	Last commit message	Last commit date
Latest commit History 669 Commits
data		data
mech_interp		mech_interp
.gitignore		.gitignore
README.md		README.md
compare_owt_and_pile.py		compare_owt_and_pile.py
config.py		config.py
eval_all.sh		eval_all.sh
eval_pile.py		eval_pile.py
figure1.pdf		figure1.pdf
figure1.py		figure1.py
generate_std_dict.py		generate_std_dict.py
gpt2-large_std_dicts.json		gpt2-large_std_dicts.json
gpt2-medium_std_dicts.json		gpt2-medium_std_dicts.json
gpt2-xl_std_dicts.json		gpt2-xl_std_dicts.json
gpt2_std_dicts.json		gpt2_std_dicts.json
ln_removal_and_push.py		ln_removal_and_push.py
pile_eval.py		pile_eval.py
prepare_dataset.py		prepare_dataset.py
requirements.txt		requirements.txt
sample.py		sample.py
sink_rate.py		sink_rate.py
std_dicts.py		std_dicts.py
test_fake_ln.py		test_fake_ln.py
train.py		train.py
upload_dataset_to_hf.py		upload_dataset_to_hf.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Transformers Don't Need LayerNorm at Inference Time

Abstract

Key Findings

Released Models

Usage

Evaluation Results

Setup

Training

List training schedules:

Show specific config:

Fine-tune away layer norm:

Evaluation

Evaluate on The Pile:

Reproduce paper results:

License

Citation

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

Languages

submarat/removing-layer-norm

Folders and files

Latest commit

History

Repository files navigation

Transformers Don't Need LayerNorm at Inference Time

Abstract

Key Findings

Released Models

Usage

Evaluation Results

Setup

Training

List training schedules:

Show specific config:

Fine-tune away layer norm:

Evaluation

Evaluate on The Pile:

Reproduce paper results:

License

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Packages