NumpyGPT

GPT from scratch. Just NumPy and Python.

why?

Understanding comes from building. This repo implements the core pieces of neural networks - modules, tokenizers, optimizers, backpropagation - using only NumPy. No autograd, no tensor abstractions. Every gradient computation is explicit.

quick start

pip install numpy 

# Tokenize data (available char-level, word-level, or subword with BPE)
./datagen.py

# Train a GPT model
./train.py

# Generate text from trained model
./sample.py

# Plot training curves
./plot.py

# Test the implementation
./test.py

core modules:

Linear, Embedding, LayerNorm, Softmax, ReLU, MultiHeadAttention, FeedForward
Adam first-order optimizer
Tokenizers char-level, word-level, bpe
GPT model (transformer decoder)

educational resources:

BACKPROP.md - what it is and how to implement it from scratch
OPTIMIZERS.md - understand the difference between Adam and SGD
TOKENIZERS.md - understand the difference between character-level, word-level, and BPE tokenization

implementation

# Every layer follows this pattern
class Linear:
    def __init__(self, in_features, out_features):
        self.W = np.random.randn(in_features, out_features) * 0.02
        self.b = np.zeros(out_features)
    
    def forward(self, X):
        self.X = X  # cache for backward
        return X @ self.W + self.b
    
    def backward(self, dY):
        # dY: gradient flowing back from next layer
        self.dW = self.X.T @ dY        # gradient w.r.t weights
        self.db = np.sum(dY, axis=0)   # gradient w.r.t bias  
        dX = dY @ self.W.T             # gradient w.r.t input
        return dX

project structure

numpyGPT/
├── nn/
│   ├── modules/         # Linear, Embedding, LayerNorm, etc.
│   └── functional.py    # cross_entropy, softmax, etc.
├── optim/               # Adam optimizer + LR scheduling
├── utils/data/          # DataLoader, Dataset
├── tokenizer/           # Character, word-level & BPE tokenizers
└── models/GPT.py        # Transformer implementation

datagen.py              # Data preprocessing
train.py                # Training script
sample.py               # Text generation
plot.py                 # Training curves (requires matplotlib)
test.py                 # Test suite

features

Explicit gradients - see exactly how backprop works
PyTorch-like API - familiar interface
Complete transformer - multi-head attention, feedforward, layer norm
Flexible tokenization - character, word-level, or BPE preprocessing
Extensive testing - test correctness of forward and backward for every layer
Minimal dependencies - just numpy and standard library

Perfect for understanding how modern language models actually work.

resources that I found helpful

pytorch's repo – architecture and API inspiration
building micrograd [YT] - backprop from scratch, explained simply
micrograd - A tiny scalar-valued autograd engine
CNN in Numpy for MNIST - CNN in NumPy for MNIST
layerNorm implementation in llm.c (Karpathy's again <3) - layernorm fwd-bwd implementation with torch
kaggle's L-layer neural network using numpy - cats/dogs classifier using numpy
forward and Backpropagation in Neural Networks using Python - forward + backward pass walkthrough

EXPERIMENTS: BPE vs Word vs Character

Three ways to represent text, three different models, same Shakespeare. Let's see what happens.

the setup

Trained three identical transformer models on Shakespeare, only difference: how we tokenize the text.

hyperparameters

Parameter	Value
batch_size	16
block_size	128
max_iters	8,000
lr	3e-4
min_lr	3e-5
n_layer	4
n_head	4
n_embd	256
warmup_iters	800
grad_clip	1.0

the tokenizers

character-level

"Hello" → ['H', 'e', 'l', 'l', 'o']

One character = one token.

word-level

"Hello world!" → ['hello', 'world', '!']

One word = one token.

Split on spaces and punctuation, lowercase everything to limit OOV (i.e., UNK).

bpe (Byte Pair Encoding)

"Hello" → ['H', 'ell', 'o']  # learned subwords

Learns frequent character pairs using BPE, builds subwords bottom-up.

training results

Metric	Character	Word	BPE
Final Loss	1.5 ⭐	3.0	3.0
Output Readability	❌ (broken words)	✅	✅ ⭐
OOV Handling	✅	❌	✅
Semantic Coherence	❌	✅	✅
Character Names	❌	✅	✅
Natural Phrases	❌	✅	✅
Training Speed	Fast → Unstable	Steady	Slow but Stable
Number of chars (500 tokens)	490	1602 ⭐	1505
Number of parameters	3.23M ⭐	6.55M	6.55M
Embedding-related parameters	68k (2.11%)⭐	3.4M (52%)	3.4M (52%)

training curves

Each model comes with a 2×2 panel of plots to track training:

Top Left: Training and validation loss over time
Top Right: Gradient norm (watch for spikes = instability)
Bottom Left: Learning rate schedule (warmup + cosine decay)
Bottom Right: Validation loss improvement per eval window

training curves

character-level

word-level

bpe

output quality

Asked each model to generate 500 tokens of Shakespeare:

bpe output

Complete output: bpe.out

KING HENRY PERCY:
And kill me Queen Margaret, and tell us not, as I have obstition?

NORTHUMBERLAND:
Why, then the king's son,
And send him that he were,

word output

Complete output: word.out

(Yes, I know I still have some tokenization issues...)

king to uncrown him as to the afternoon of aboard.
lady anne:
on a day - gone; and, for
should romeo be executed in the victory!

character output

Complete output: char.out

KINGAll, and seven dost I,
And will beset no specommed a geles, and cond upon
you with speaks, but ther so ent the vength

key insights

Lower loss ≠ better output: Character model had lowest loss but worst readability --> loss is lower because the model is predicting 1 of 69 characters, which is much easier than predicting 1 of 6,551 words/subwords.
Number of parameters: Despite having the same architecture configuration, the BPE and word level have a much larger embedding matrix that brings its parameters from 3.2M to 6.55M (52% just embedding-related tokens).
Token efficiency matters: Same 500 output tokens generated vastly different text lengths: from ~500 with character, ~1500 with BPE and ~1600 with word.
Stability matters: BPE's consistent training beats unstable fast learning
The curse of granularity: Finer tokens (char) = easier prediction but harder composition. Coarser tokens (word) = harder prediction but natural composition.
There's no free lunch: Each approach trades off different aspects

TODOs:

bpe with byte-fallback

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NumpyGPT

why?

quick start

implementation

project structure

features

resources that I found helpful

EXPERIMENTS: BPE vs Word vs Character

the setup

hyperparameters

the tokenizers

character-level

word-level

bpe (Byte Pair Encoding)

training results

training curves

training curves

character-level

word-level

bpe

output quality

bpe output

word output

character output

key insights

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
assets		assets
data		data
docs		docs
numpyGPT		numpyGPT
tests		tests
.gitignore		.gitignore
README.md		README.md
datagen.py		datagen.py
plot.py		plot.py
sample.py		sample.py
test.py		test.py
train.py		train.py

codiceSpaghetti/numpyGPT

Folders and files

Latest commit

History

Repository files navigation

NumpyGPT

why?

quick start

implementation

project structure

features

resources that I found helpful

EXPERIMENTS: BPE vs Word vs Character

the setup

hyperparameters

the tokenizers

character-level

word-level

bpe (Byte Pair Encoding)

training results

training curves

training curves

character-level

word-level

bpe

output quality

bpe output

word output

character output

key insights

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages