Skip to content

A from-scratch GPT built with NumPy and Python’s standard library. No autograd, no frameworks: every layer is re-implemented with its own forward and backward pass. Gradients are computed manually, updates are transparent, and every operation is spelled out.

Notifications You must be signed in to change notification settings

codiceSpaghetti/numpyGPT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NumpyGPT

GPT from scratch. Just NumPy and Python.

alt text

why?

Understanding comes from building. This repo implements the core pieces of neural networks - modules, tokenizers, optimizers, backpropagation - using only NumPy. No autograd, no tensor abstractions. Every gradient computation is explicit.

quick start

pip install numpy 

# Tokenize data (available char-level, word-level, or subword with BPE)
./datagen.py

# Train a GPT model
./train.py

# Generate text from trained model
./sample.py

# Plot training curves
./plot.py

# Test the implementation
./test.py

core modules:

  • Linear, Embedding, LayerNorm, Softmax, ReLU, MultiHeadAttention, FeedForward
  • Adam first-order optimizer
  • Tokenizers char-level, word-level, bpe
  • GPT model (transformer decoder)

educational resources:

  • BACKPROP.md - what it is and how to implement it from scratch
  • OPTIMIZERS.md - understand the difference between Adam and SGD
  • TOKENIZERS.md - understand the difference between character-level, word-level, and BPE tokenization

implementation

# Every layer follows this pattern
class Linear:
    def __init__(self, in_features, out_features):
        self.W = np.random.randn(in_features, out_features) * 0.02
        self.b = np.zeros(out_features)
    
    def forward(self, X):
        self.X = X  # cache for backward
        return X @ self.W + self.b
    
    def backward(self, dY):
        # dY: gradient flowing back from next layer
        self.dW = self.X.T @ dY        # gradient w.r.t weights
        self.db = np.sum(dY, axis=0)   # gradient w.r.t bias  
        dX = dY @ self.W.T             # gradient w.r.t input
        return dX

project structure

numpyGPT/
├── nn/
│   ├── modules/         # Linear, Embedding, LayerNorm, etc.
│   └── functional.py    # cross_entropy, softmax, etc.
├── optim/               # Adam optimizer + LR scheduling
├── utils/data/          # DataLoader, Dataset
├── tokenizer/           # Character, word-level & BPE tokenizers
└── models/GPT.py        # Transformer implementation

datagen.py              # Data preprocessing
train.py                # Training script
sample.py               # Text generation
plot.py                 # Training curves (requires matplotlib)
test.py                 # Test suite

features

  • Explicit gradients - see exactly how backprop works
  • PyTorch-like API - familiar interface
  • Complete transformer - multi-head attention, feedforward, layer norm
  • Flexible tokenization - character, word-level, or BPE preprocessing
  • Extensive testing - test correctness of forward and backward for every layer
  • Minimal dependencies - just numpy and standard library

Perfect for understanding how modern language models actually work.

resources that I found helpful


EXPERIMENTS: BPE vs Word vs Character

Three ways to represent text, three different models, same Shakespeare. Let's see what happens.

the setup

Trained three identical transformer models on Shakespeare, only difference: how we tokenize the text.

hyperparameters

Parameter Value
batch_size 16
block_size 128
max_iters 8,000
lr 3e-4
min_lr 3e-5
n_layer 4
n_head 4
n_embd 256
warmup_iters 800
grad_clip 1.0

the tokenizers

character-level

"Hello" → ['H', 'e', 'l', 'l', 'o']

One character = one token.

word-level

"Hello world!" → ['hello', 'world', '!']

One word = one token.

Split on spaces and punctuation, lowercase everything to limit OOV (i.e., UNK).

bpe (Byte Pair Encoding)

"Hello" → ['H', 'ell', 'o']  # learned subwords

Learns frequent character pairs using BPE, builds subwords bottom-up.

training results

Metric Character Word BPE
Final Loss 1.5 ⭐ 3.0 3.0
Output Readability ❌ (broken words) ✅ ⭐
OOV Handling
Semantic Coherence
Character Names
Natural Phrases
Training Speed Fast → Unstable Steady Slow but Stable
Number of chars (500 tokens) 490 1602 ⭐ 1505
Number of parameters 3.23M ⭐ 6.55M 6.55M
Embedding-related parameters 68k (2.11%)⭐ 3.4M (52%) 3.4M (52%)

training curves

Each model comes with a 2×2 panel of plots to track training:

  • Top Left: Training and validation loss over time
  • Top Right: Gradient norm (watch for spikes = instability)
  • Bottom Left: Learning rate schedule (warmup + cosine decay)
  • Bottom Right: Validation loss improvement per eval window

training curves

character-level

Character Training Curves

word-level

Word Training Curves

bpe

BPE Training Curves

output quality

Asked each model to generate 500 tokens of Shakespeare:

bpe output

Complete output: bpe.out

KING HENRY PERCY:
And kill me Queen Margaret, and tell us not, as I have obstition?

NORTHUMBERLAND:
Why, then the king's son,
And send him that he were,

word output

Complete output: word.out

(Yes, I know I still have some tokenization issues...)

king to uncrown him as to the afternoon of aboard.
lady anne:
on a day - gone; and, for
should romeo be executed in the victory!

character output

Complete output: char.out

KINGAll, and seven dost I,
And will beset no specommed a geles, and cond upon
you with speaks, but ther so ent the vength

key insights

  1. Lower loss ≠ better output: Character model had lowest loss but worst readability --> loss is lower because the model is predicting 1 of 69 characters, which is much easier than predicting 1 of 6,551 words/subwords.
  2. Number of parameters: Despite having the same architecture configuration, the BPE and word level have a much larger embedding matrix that brings its parameters from 3.2M to 6.55M (52% just embedding-related tokens).
  3. Token efficiency matters: Same 500 output tokens generated vastly different text lengths: from ~500 with character, ~1500 with BPE and ~1600 with word.
  4. Stability matters: BPE's consistent training beats unstable fast learning
  5. The curse of granularity: Finer tokens (char) = easier prediction but harder composition. Coarser tokens (word) = harder prediction but natural composition.
  6. There's no free lunch: Each approach trades off different aspects

TODOs:

  • bpe with byte-fallback

About

A from-scratch GPT built with NumPy and Python’s standard library. No autograd, no frameworks: every layer is re-implemented with its own forward and backward pass. Gradients are computed manually, updates are transparent, and every operation is spelled out.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages