GPT-Lab (currently in ALPHA)

this repo is a massive overhaul of Modded-NanoGPT with the goal of being a base for amateurs to do cheap & easy LLM experiments at a large enough scale to be worthy of an arxiv preprint. the idea is that repos like Modded-NanoGPT, NanoGPT, TinyLlama, and Meta's Lingua, are either too old of an architecture, too purpose-specific, not from-scratch enough, too expensive to run, too overly-complicated, not well setup for quickly iterating research ideas, etc and we plan to occupy a unique balance of those trade-offs

this repo is currently in alpha, meaning that I think it's somewhat workable but have not utilized it on enough of my own experiments to guarantee that. before taking it out of alpha I will:

implement the further improvements defined in the todo section below and
go and implement a few experiment ideas and use what I learn from the difficulties I run into to add more things to the todo list

check out the video I made about it:

getting started

the input arguments in these instructions are comically small values designed to get you up and running on the tiniest GPU(s) for demonstration purposes; in practice you'll have to tune them to properly utilize the available VRAM of your setup

either have one or more GPUs or hook up to a cloud GPU. for the latter see this tutorial; i recommend vast.ai since they're always at or near the cheapest
either fork or create a template of this repo
pip install -r requirements.txt
train your tokenizer on fineweb. samples is the number of text characters to train on (split up evenly across all GPUs). vocabulary size should exclude any special tokens you plan on using later. for a tutorial on how Byte-Pair Encoding (BPE) tokenizers work, see andrej karpathy's video for a simple & slow CPU implementation

single GPU:

python train_tokenizer.py --samples 100000 --vocabsize 1000 --name readmetokenizer --demo

multiple GPUs (replace G with the number of GPUs you have):

torchrun --nproc_per_node=G train_tokenizer.py --samples 100000 --vocabsize 1000 --name readmetokenizer --demo

download the fineweb dataset and convert all the raw text into tokens. dataset options are 10B, 100B, 10Bedu (default), or 100Bedu. tune shard_size (default 100 million) and num_shards to the quantity of data for your desired training run length. the script will only create one shard for the validation set which is not included in the count of num_shards

python download_fineweb.py --num_shards 1 --version 10B --shard_size 10000000 --tokenizer readmetokenizer_v1000_n100000.pkl

download the hellaswag benchmark: python download_hellaswag.py
train your language model. vocabulary size must be equial to your tokenizer size PLUS any special tokens defined in this script (1 for '<|endoftext|>', so 1000 + 1 = 10001). WARNING: if you include --save_model that will create a .pt file of the model weights, but by default the .gitignore will now allow this file to be pushed to github with the rest of the repo. this is done because the filesize is too large for github, and it means you have to find a way to download the model weights manually if you're on a cloud GPU and want to keep them

single GPU:

python train_gpt.py --model_name ReadmeGPT --tokenizer readmetokenizer_v1000_n100000.pkl --vocab_size 1001 --model_dim 128 --num_heads 4 --num_layers 6

multiple GPUs (replace G):

torchrun --nproc_per_node=G train_gpt.py --model_name ReadmeGPT --tokenizer readmetokenizer_v1000_n100000.pkl --vocab_size 1001 --model_dim 128 --num_heads 4 --num_layers 6

look in experiments/ for your model. you should see 1) a .txt backup of all the .py files we just ran at the time of training (except train_tokenizer.py, which is backed inside the tokenizer .pkl file and therefore not readable from a file browser), 2) a .csv containing the training time & loss, 3) a log file containing important information such as the hellaswag benchmark score and the maximum memory allocated during training, and 4) maybe a .pt file if you elected to run with --save_model
great, now that all that is confirmed to be up & working you can start editing the code and running your own experiments by building off the baselines below!

baselines

we've trained some baselines for your experiments to compare against. For now (while the repo is in alpha/beta), they are absurdly sh*tty and really only here for demonstration purposes. As the repo improves, we will push new improved baselines of larger sizes, with better tuned hyperparameters, trained on more tokens, etc. The goal is to eventually closely resemble the GPT2 series of models in parameter count (maybe even larger) and train on as many tokens as possible while still keeping costs realistic for dedicated amateurs

Baseline	XS	S	M
Parameters (millions)	57.8	117.7	342.5
Tokens Traind On (billions)	0.1	0.4	1.0
GPU	RTX 3070	RTX 4060 Ti	A40
VRAM Per GPU	8GB	16GB	45GB
GPU Count	1	2	4
GPU Cost Per Hour (US Dollars)	$0.113	$0.257	$1.761
Trainimg Time (minutes)	12.02	51.59	94.93
Estimated Total Cost (US Dollars)	$0.14	$0.48	$4.55

NOTES:

Total cost is estimated as Total Cost = ((Training Time) + (60 minutes)) * (GPU Cost Per Hour) to reflect the overhead of starting up your cloud GPU instance, testing which hyperparameters best utilize VRAM, running validation data and benchmarks, pushing changes and closing down your instance.
All costs reflect GPUs rented from vast.ai on Apr 18, 2025

todos / planned features:

Modded-NanoGPT (ORIGINAL README)

This repository hosts the NanoGPT speedrun, in which we (collaboratively|competitively) search for the fastest algorithm to use 8 NVIDIA H100 GPUs to train a language model that attains 3.28 cross-entropy loss on the FineWeb validation set.

The target (3.28 validation loss on FineWeb) follows Andrej Karpathy's GPT-2 replication in llm.c, which attains that loss after running for 45 minutes. The speedrun code also descends from llm.c's PyTorch trainer, which itself descends from NanoGPT, hence the name of the repo. Thanks to the efforts of many contributors, this repo now contains a training algorithm which attains the target performance in:

3 minutes on 8xH100 (the llm.c GPT-2 replication needed 45)
0.73B tokens (the llm.c GPT-2 replication needed 10B)

This improvement in training speed has been brought about by the following techniques:

Modernized architecture: Rotary embeddings, QK-Norm, and ReLU²
The Muon optimizer [writeup] [repo]
Untie head from embedding, use FP8 matmul for head, and softcap logits (the latter following Gemma 2)
Initialization of projection and classification layers to zero (muP-like)
Skip connections from embedding to every block as well as between blocks in U-net pattern
Extra embeddings which are mixed into the values in attention layers (inspired by Zhou et al. 2024)
FlexAttention with long-short sliding window attention pattern (inspired by Gemma 2) and window size warmup

Contributors list (growing with each new record): @bozavlado, @brendanh0gan, @fernbear.bsky.social, @Grad62304977, @jxbz, @kellerjordan0, @KoszarskyB, @leloykun, @YouJiacheng

Running the current record

To run the current record, run the following commands.

git clone https://github.com/KellerJordan/modded-nanogpt.git && cd modded-nanogpt
pip install -r requirements.txt
pip install --pre torch==2.7.0.dev20250110+cu126 --index-url https://download.pytorch.org/whl/nightly/cu126 --upgrade
python data/cached_fineweb10B.py 8 # downloads only the first 800M training tokens to save time
./run.sh

Note: torch.compile will take around 5 minutes the first time you run the code.

Alternative: Running with Docker (recommended for timing)

For cases where CUDA or NCCL versions aren't compatible with your current system setup, Docker can be a helpful alternative. This approach standardizes versions for CUDA, NCCL, CUDNN, and Python, reducing dependency issues and simplifying setup. Note: an NVIDIA driver must already be installed on the system (useful if only the NVIDIA driver and Docker are available).

git clone https://github.com/KellerJordan/modded-nanogpt.git && cd modded-nanogpt
sudo docker build -t modded-nanogpt .
sudo docker run -it --rm --gpus all -v $(pwd):/modded-nanogpt modded-nanogpt python data/cached_fineweb10B.py 8
sudo docker run -it --rm --gpus all -v $(pwd):/modded-nanogpt modded-nanogpt sh run.sh

To get an interactive docker, you can use

sudo docker run -it --rm --gpus all -v $(pwd):/modded-nanogpt modded-nanogpt bash

World record history

The following is the historical progression of world speed records for the following competitive task:

Train a neural network to ≤3.28 validation loss on FineWeb using 8x NVIDIA H100s.

Note: The 3.28 target was selected to match Andrej Karpathy's GPT-2 (small) reproduction.

#	Record time	Description	Date	Log	Contributors
1	45 minutes	llm.c baseline	05/28/24	log	@karpathy, llm.c contributors
2	31.4 minutes	Tuned learning rate & rotary embeddings	06/06/24	log	@kellerjordan0
3	24.9 minutes	Introduced the Muon optimizer	10/04/24	none	@kellerjordan0, @jxbz
4	22.3 minutes	Muon improvements	10/11/24	log	@kellerjordan0, @bozavlado
5	15.2 minutes	Pad embeddings, ReLU², zero-init projections, QK-norm	10/14/24	log	@Grad62304977, @kellerjordan0
6	13.1 minutes	Distributed the overhead of Muon	10/18/24	log	@kellerjordan0
7	12.0 minutes	Upgraded PyTorch 2.5.0	10/18/24	log	@kellerjordan0
8	10.8 minutes	Untied embedding and head	11/03/24	log	@Grad62304977, @kellerjordan0
9	8.2 minutes	Value and embedding skip connections, momentum warmup, logit softcap	11/06/24	log	@Grad62304977, @kellerjordan0
10	7.8 minutes	Bfloat16 activations	11/08/24	log	@kellerjordan0
11	7.2 minutes	U-net pattern skip connections & double lr	11/10/24	log	@brendanh0gan
12	5.03 minutes	1024-ctx dense causal attention → 64K-ctx FlexAttention	11/19/24	log	@KoszarskyB
13	4.66 minutes	Attention window warmup	11/24/24	log	@fernbear.bsky.social
14	4.41 minutes	Value Embeddings	12/04/24	log	@KoszarskyB
15	3.95 minutes	U-net pattern value embeddings, assorted code optimizations	12/08/24	log	@leloykun, @YouJiacheng
16	3.80 minutes	Split value embeddings, block sliding window, separate block mask	12/10/24	log	@YouJiacheng
17	3.57 minutes	Sparsify value embeddings, improve rotary embeddings, drop an attn layer	12/17/24	log	@YouJiacheng
18	3.4 minutes	Lower logit softcap from 30 to 15	01/04/25	log	@KoszarskyB
19	3.142 minutes	FP8 head, offset logits, lr decay to 0.1 instead of 0.0	01/13/25	log	@YouJiacheng
20	2.992 minutes	Merged QKV weights, long-short attention, attention scale, lower Adam epsilon, batched Muon	01/16/25	log	@leloykun, @fernbear.bsky.social, @YouJiacheng, @brendanh0gan, @scottjmaddox, @Grad62304977
21	2.933 minutes	Reduced batch size	01/26/25	log	@leloykun
21	2.997 minutes	21st record with new timing	02/01/25	log	not a new record, just re-timing #21 with the updated rules

Rules

The only rules are that new records must:

Not modify the train or validation data pipelines. (You can change the batch size, sequence length, attention structure etc.; just don't change the underlying streams of tokens.)
Attain ≤3.28 mean val loss. (Due to inter-run variance, submissions must provide enough run logs to attain a statistical significance level of p<0.01 that their mean val loss is ≤3.28. Example code to compute p-value can be found here.)
Not use any extra torch._inductor.config or torch.compile flags. (These can save a few seconds, but they can also make compilation take >30min. This rule was introduced after the 21st record.)

Other than that, anything and everything is fair game!

further clarifications

Comment on the target metric

The target metric is cross-entropy loss on the FineWeb val set. To speak mathematically, the goal of the speedrun is *to obtain a probability model of language which assigns a probability of at least math.exp(-3.28 * 10485760) to the first 10,485,760 tokens of the FineWeb valset. Hence, e.g., we allow evaluation at any sequence length, so long as we still have a valid probability model of language.

Timing change after record 21

After the 21st record, we made two changes to the timing. First, there used to be an initial "grace period" of 10 untimed steps to allow kernel warmup. We replaced this with an explicit kernel-warmup section which is untimed and uses dummy data. This results in an extra runtime of 850ms from the 10 extra timed steps. Second, we banned the use of torch._inductor.config.coordinate_descent_tuning. This saves ~25min of untimed pre-run compilation, but results in an extra runtime of ~3s.

Notable attempts & forks

Notable runs:

@alexjc's 01/20/2025 2.77-minute TokenMonster-based record. This record is technically outside the rules of the speedrun, since we specified that the train/val tokens must be kept fixed. However, it's very interesting, and worth including. The run is not more data-efficient; rather, the speedup comes from the improved tokenizer allowing the vocabulary size to be reduced (nearly halved!) while preserving the same bytes-per-token, which saves lots of parameters and FLOPs in the head and embeddings.

Notable forks:

Speedrun track 2: GPT-2 Medium

The target loss for this track is lowered from 3.28 to 2.92, as per Andrej Karpathy's 350M-parameter llm.c baseline. This baseline generates a model with performance similar to the original GPT-2 Medium, whereas the first track's baseline generates a model on par with GPT-2 Small. All other rules remain the same.

#	Record time	Description	Date	Log	Contributors
1	5.8 hours	llm.c baseline (350M parameters)	05/28/24	log	@karpathy, llm.c contributors
2	29.3 minutes	Initial record based on scaling up the GPT-2 small track speedrun	01/18/25	log	@kellerjordan0
3	28.1 minutes	Added standard weight decay	02/08/25	log	@kellerjordan0
4	27.7 minutes	Tuned Muon Newton-Schulz coefficients	02/14/25	log	@leloykun
5	27.2 minutes	Increased learning rate cooldown phase duration	03/06/25	log	@YouJiacheng

Q: What is the point of NanoGPT speedrunning?

A: The officially stated goal of NanoGPT speedrunning is as follows: gotta go fast. But for something a little more verbose involving an argument for good benchmarking, here's some kind of manifesto, adorned with a blessing from the master. https://x.com/karpathy/status/1846790537262571739

Q: What makes "NanoGPT speedrunning" not just another idiosyncratic benchmark?

A: Because it is a competitive benchmark. In particular, if you attain a new speed record (using whatever method you want), there is an open invitation for you to post that record (on arXiv or X) and thereby vacuum up all the clout for yourself. I will even help you do it by reposting you as much as I can.

"Artificial intelligence advances by inventing games and gloating to goad others to play" - Professor Ben Recht

Q: NanoGPT speedrunning is cool and all, but meh it probably won't scale and is just overfitting to val loss

A: This is hard to refute, since "at scale" is an infinite category (what if the methods stop working only for >100T models?), making it impossible to fully prove. Also, I would agree that some of the methods used in the speedrun are unlikely to scale, particularly those which impose additional structure on the network, such as logit softcapping. But if the reader cares about 1.5B models, they might be convinced by this result:

Straightforwardly scaling up the speedrun (10/18/24 version) to 1.5B parameters yields a model with GPT-2 (1.5B)-level HellaSwag performance 2.5x more cheaply than @karpathy's baseline ($233 instead of $576):

[reproducible log]

Muon optimizer

Muon is defined as follows:

Where NewtonSchulz5 is the following Newton-Schulz iteration [2, 3], which approximately replaces G with U @ V.T where U, S, V = G.svd().

@torch.compile
def zeroth_power_via_newtonschulz5(G, steps=5, eps=1e-7):
    assert len(G.shape) == 2
    a, b, c = (3.4445, -4.7750,  2.0315)
    X = G.bfloat16() / (G.norm() + eps)
    if G.size(0) > G.size(1):
        X = X.T 
    for _ in range(steps):
        A = X @ X.T
        B = b * A + c * A @ A
        X = a * X + B @ X
    if G.size(0) > G.size(1):
        X = X.T 
    return X.to(G.dtype)

For this training scenario, Muon has the following favorable properties:

Lower memory usage than Adam
~1.5x better sample-efficiency
<2% wallclock overhead

Provenance

Many of the choices made to generate this optimizer were obtained experimentally by our pursuit of CIFAR-10 speedrunning. In particular, we experimentally obtained the following practices:

Using Nesterov momentum inside the update, with orthogonalization applied after momentum.
Using a specifically quintic Newton-Schulz iteration as the method of orthogonalization.
Using non-convergent coefficients for the quintic polynomial in order to maximize slope at zero, and thereby minimize the number of necessary Newton-Schulz iterations. It turns out that the variance doesn't actually matter that much, so we end up with a quintic that rapidly converges to the range 0.68, 1.13 upon repeated application, rather than converging more slowly to 1.
Running the Newton-Schulz iteration in bfloat16 (whereas Shampoo implementations often depend on inverse-pth-roots run in fp32 or fp64).

Our use of a Newton-Schulz iteration for orthogonalization traces to Bernstein & Newhouse (2024), who suggested it as a way to compute Shampoo [5, 6] preconditioners, and theoretically explored Shampoo without preconditioner accumulation. In particular, Jeremy Bernstein @jxbz sent us the draft, which caused us to experiment with various Newton-Schulz iterations as the orthogonalization method for this optimizer. If we had used SVD instead of a Newton-Schulz iteration, this optimizer would have been too slow to be useful. Bernstein & Newhouse also pointed out that Shampoo without preconditioner accumulation is equivalent to steepest descent in the spectral norm, and therefore Shampoo can be thought of as a way to smooth out spectral steepest descent. The proposed optimizer can be thought of as a second way of smoothing spectral steepest descent, with a different set of memory and runtime tradeoffs compared to Shampoo.

Running on fewer GPUs

To run experiments on fewer GPUs, simply modify run.sh to have a different --nproc_per_node. This should not change the behavior of the training.
If you're running out of memory, you may need to reduce the sequence length for FlexAttention (which does change the training. see here for a guide)

References

Guilherme Penedo et al. "The fineweb datasets: Decanting the web for the finest text data at scale." arXiv preprint arXiv:2406.17557 (2024).
Nicholas J. Higham. Functions of Matrices. Society for Industrial and Applied Mathematics (2008). Equation 5.22.
GÃ¼nther Schulz. Iterative Berechnung der reziproken Matrix. Z. Angew. Math. Mech., 13:57â��59 (1933).
Jeremy Bernstein and Laker Newhouse. "Old Optimizer, New Norm: An Anthology." arxiv preprint arXiv:2409.20325 (2024).
Vineet Gupta, Tomer Koren, and Yoram Singer. "Shampoo: Preconditioned stochastic tensor optimization." International Conference on Machine Learning. PMLR, 2018.
Rohan Anil et al. "Scalable second order optimization for deep learning." arXiv preprint arXiv:2002.09018 (2020).
Alexander HÃ¤gele et al. "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations." arXiv preprint arXiv:2405.18392 (2024).
Zhanchao Zhou et al. "Value Residual Learning For Alleviating Attention Concentration In Transformers." arXiv preprint arXiv:2410.17897 (2024).
Team, Gemma, et al. "Gemma 2: Improving open language models at a practical size." arXiv preprint arXiv:2408.00118 (2024).
Alec Radford et al. "Language models are unsupervised multitask learners." OpenAI blog 1.8 (2019).

Citation

@misc{modded_nanogpt_2024,
  author       = {Keller Jordan and Jeremy Bernstein and Brendan Rappazzo and
                  @fernbear.bsky.social and Boza Vlado and You Jiacheng and
                  Franz Cesista and Braden Koszarsky and @Grad62304977},
  title        = {modded-nanogpt: Speedrunning the NanoGPT baseline},
  year         = {2024},
  url          = {https://github.com/KellerJordan/modded-nanogpt}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPT-Lab (currently in ALPHA)

getting started

baselines

todos / planned features:

Modded-NanoGPT (ORIGINAL README)

Running the current record

Alternative: Running with Docker (recommended for timing)

World record history

Rules

Comment on the target metric

Timing change after record 21

Notable attempts & forks

Speedrun track 2: GPT-2 Medium

Q: What is the point of NanoGPT speedrunning?

Q: What makes "NanoGPT speedrunning" not just another idiosyncratic benchmark?

Q: NanoGPT speedrunning is cool and all, but meh it probably won't scale and is just overfitting to val loss

Muon optimizer

Provenance

Running on fewer GPUs

References

Citation

About

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1,094 Commits
experiments		experiments
img		img
tokenizers		tokenizers
.gitignore		.gitignore
HOW_TO_EXPERIMENT.md		HOW_TO_EXPERIMENT.md
LICENSE		LICENSE
README.md		README.md
download_fineweb.py		download_fineweb.py
download_hellaswag.py		download_hellaswag.py
requirements.txt		requirements.txt
train_gpt.py		train_gpt.py
train_tokenizer.py		train_tokenizer.py

License

evintunador/gpt-lab

Folders and files

Latest commit

History

Repository files navigation

GPT-Lab (currently in ALPHA)

getting started

baselines

todos / planned features:

Modded-NanoGPT (ORIGINAL README)

Running the current record

Alternative: Running with Docker (recommended for timing)

World record history

Rules

Comment on the target metric

Timing change after record 21

Notable attempts & forks

Speedrun track 2: GPT-2 Medium

Q: What is the point of NanoGPT speedrunning?

Q: What makes "NanoGPT speedrunning" not just another idiosyncratic benchmark?

Q: NanoGPT speedrunning is cool and all, but meh it probably won't scale and is just overfitting to val loss

Muon optimizer

Provenance

Running on fewer GPUs

References

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages