This repository contains a clean PyTorch implementation of a GPT-2 style transformer model. The implementation is based on the architecture described in the "Attention Is All You Need" paper, with modifications specific to GPT-2.
The codebase demonstrates how to build a transformer-based language model from scratch, including:
- Multi-head self-attention
- Layer normalization
- Feed-forward (MLP) layers
- Token and positional embeddings
- Autoregressive text generation
solution.py
: Main implementation of the GPT-2 model components and training functionsgpt2_small/
: Python package that wraps the implementation for easy importtests/
: Test suite to validate model componentsdemos/
: Demonstration scripts for training and text generationpyproject.toml
: Project configuration and dependencies (used for installation)setup_env.sh
: Scripts to setup virtual environment and install the package
- Python 3.8 or higher
- pip (package installer for Python)
# Make the script executable
chmod +x setup_env.sh
# Run the setup script and activate the environment
./setup_env.sh
The project is set up as a Python package that can be installed with pip:
# Install in development mode (changes to the code will be reflected immediately)
pip install -e .
# Install with development dependencies
pip install -e ".[dev]"
After installation, you can import the components from the package:
from gpt2_small import Transformer, TransformerConfig
The implementation includes these key components:
TransformerConfig
: Configuration class for model hyperparametersMultiHeadAttention
: Implementation of masked multi-head self-attentionLayerNorm
: Layer normalization implementationMLP
: Feed-forward network used in transformer blocksTransformerBlock
: Full transformer block (attention + MLP)Embedding
: Token embedding lookupPositionalEmbedding
: Position encodingUnembedding
: Linear projection from model dimension to vocabTransformer
: Top-level model class that combines all components
Run the test suite to verify all model components:
python test_model.py
The implementation includes functions for training a transformer model. Two approaches are available:
python ./demos/train_demo.py
The training demo:
- Creates a small model suitable for quick training
- Shows generation samples before training begins
- Trains the model on a simple dataset
- Displays generation samples throughout training to show progress
- Tracks loss metrics and generation quality improvement
The implementation includes functions for generating text from the model:
# Run the comprehensive generation demo
python ./demos/generate_demo.py
The generation demo showcases:
- Generation with a randomly initialized model
- Generation with different temperature settings (controlling randomness)
- Generation with top-k sampling (limiting the token selection pool)
- Generation with pre-trained weights when available
The demos are designed as - they demonstrate all components working together correctly and produce outputs that can be evaluated visually.
The demos/
directory contains scripts designed to demonstrate the capabilities of the implementation with human-readable outputs:
- PyTorch
- transformer_lens
- einops
- datasets
- matplotlib (for visualization)
- jaxtyping (for type annotations)
- Attention Is All You Need - Original transformer paper
- TinyStories: How Small Can Language Models Be and Still Speak Coherent English? - tiny stories dataset
- GPT-2: Language Models are Unsupervised Multitask Learners - GPT-2 paper