GPT from scratch. Just NumPy and Python.
Understanding comes from building. This repo implements the core pieces of neural networks - modules, tokenizers, optimizers, backpropagation - using only NumPy. No autograd, no tensor abstractions. Every gradient computation is explicit.
pip install numpy
# Tokenize data (available char-level, word-level, or subword with BPE)
./datagen.py
# Train a GPT model
./train.py
# Generate text from trained model
./sample.py
# Plot training curves
./plot.py
# Test the implementation
./test.py
core modules:
Linear
,Embedding
,LayerNorm
,Softmax
,ReLU
,MultiHeadAttention
,FeedForward
Adam
first-order optimizerTokenizers
char-level, word-level, bpeGPT
model (transformer decoder)
educational resources:
- BACKPROP.md - what it is and how to implement it from scratch
- OPTIMIZERS.md - understand the difference between Adam and SGD
- TOKENIZERS.md - understand the difference between character-level, word-level, and BPE tokenization
# Every layer follows this pattern
class Linear:
def __init__(self, in_features, out_features):
self.W = np.random.randn(in_features, out_features) * 0.02
self.b = np.zeros(out_features)
def forward(self, X):
self.X = X # cache for backward
return X @ self.W + self.b
def backward(self, dY):
# dY: gradient flowing back from next layer
self.dW = self.X.T @ dY # gradient w.r.t weights
self.db = np.sum(dY, axis=0) # gradient w.r.t bias
dX = dY @ self.W.T # gradient w.r.t input
return dX
numpyGPT/
├── nn/
│ ├── modules/ # Linear, Embedding, LayerNorm, etc.
│ └── functional.py # cross_entropy, softmax, etc.
├── optim/ # Adam optimizer + LR scheduling
├── utils/data/ # DataLoader, Dataset
├── tokenizer/ # Character, word-level & BPE tokenizers
└── models/GPT.py # Transformer implementation
datagen.py # Data preprocessing
train.py # Training script
sample.py # Text generation
plot.py # Training curves (requires matplotlib)
test.py # Test suite
- Explicit gradients - see exactly how backprop works
- PyTorch-like API - familiar interface
- Complete transformer - multi-head attention, feedforward, layer norm
- Flexible tokenization - character, word-level, or BPE preprocessing
- Extensive testing - test correctness of forward and backward for every layer
- Minimal dependencies - just numpy and standard library
Perfect for understanding how modern language models actually work.
- pytorch's repo – architecture and API inspiration
- building micrograd [YT] - backprop from scratch, explained simply
- micrograd - A tiny scalar-valued autograd engine
- CNN in Numpy for MNIST - CNN in NumPy for MNIST
- layerNorm implementation in llm.c (Karpathy's again <3) - layernorm fwd-bwd implementation with torch
- kaggle's L-layer neural network using numpy - cats/dogs classifier using numpy
- forward and Backpropagation in Neural Networks using Python - forward + backward pass walkthrough
Three ways to represent text, three different models, same Shakespeare. Let's see what happens.
Trained three identical transformer models on Shakespeare, only difference: how we tokenize the text.
Parameter | Value |
---|---|
batch_size | 16 |
block_size | 128 |
max_iters | 8,000 |
lr | 3e-4 |
min_lr | 3e-5 |
n_layer | 4 |
n_head | 4 |
n_embd | 256 |
warmup_iters | 800 |
grad_clip | 1.0 |
"Hello" → ['H', 'e', 'l', 'l', 'o']
One character = one token.
"Hello world!" → ['hello', 'world', '!']
One word = one token.
Split on spaces and punctuation, lowercase everything to limit OOV (i.e., UNK).
"Hello" → ['H', 'ell', 'o'] # learned subwords
Learns frequent character pairs using BPE, builds subwords bottom-up.
Metric | Character | Word | BPE |
---|---|---|---|
Final Loss | 1.5 ⭐ | 3.0 | 3.0 |
Output Readability | ❌ (broken words) | ✅ | ✅ ⭐ |
OOV Handling | ✅ | ❌ | ✅ |
Semantic Coherence | ❌ | ✅ | ✅ |
Character Names | ❌ | ✅ | ✅ |
Natural Phrases | ❌ | ✅ | ✅ |
Training Speed | Fast → Unstable | Steady | Slow but Stable |
Number of chars (500 tokens) | 490 | 1602 ⭐ | 1505 |
Number of parameters | 3.23M ⭐ | 6.55M | 6.55M |
Embedding-related parameters | 68k (2.11%)⭐ | 3.4M (52%) | 3.4M (52%) |
Each model comes with a 2×2 panel of plots to track training:
- Top Left: Training and validation loss over time
- Top Right: Gradient norm (watch for spikes = instability)
- Bottom Left: Learning rate schedule (warmup + cosine decay)
- Bottom Right: Validation loss improvement per eval window
Asked each model to generate 500 tokens of Shakespeare:
Complete output: bpe.out
KING HENRY PERCY:
And kill me Queen Margaret, and tell us not, as I have obstition?
NORTHUMBERLAND:
Why, then the king's son,
And send him that he were,
Complete output: word.out
(Yes, I know I still have some tokenization issues...)
king to uncrown him as to the afternoon of aboard.
lady anne:
on a day - gone; and, for
should romeo be executed in the victory!
Complete output: char.out
KINGAll, and seven dost I,
And will beset no specommed a geles, and cond upon
you with speaks, but ther so ent the vength
- Lower loss ≠ better output: Character model had lowest loss but worst readability --> loss is lower because the model is predicting 1 of 69 characters, which is much easier than predicting 1 of 6,551 words/subwords.
- Number of parameters: Despite having the same architecture configuration, the BPE and word level have a much larger embedding matrix that brings its parameters from 3.2M to 6.55M (52% just embedding-related tokens).
- Token efficiency matters: Same 500 output tokens generated vastly different text lengths: from ~500 with character, ~1500 with BPE and ~1600 with word.
- Stability matters: BPE's consistent training beats unstable fast learning
- The curse of granularity: Finer tokens (char) = easier prediction but harder composition. Coarser tokens (word) = harder prediction but natural composition.
- There's no free lunch: Each approach trades off different aspects
TODOs:
- bpe with byte-fallback