A complete implementation of the Transformer architecture based on the "Attention Is All You Need" paper by Vaswani et al. (2017), trained on the complete works of Sherlock Holmes to generate coherent, Victorian-era prose.
This project implements a GPT-style Transformer model from scratch using PyTorch, demonstrating:
- Multi-Head Self-Attention: The core mechanism that allows the model to attend to different parts of the input sequence
- Positional Encoding: Learned position embeddings to understand sequence order
- Layer Normalization & Residual Connections: Pre-LN design for better gradient flow
- Feedforward Networks: Three-layer MLP with GELU activation
- Subword Tokenization: Using tiktoken (GPT-2's tokenizer) for efficient text processing
The model follows the decoder-only Transformer architecture with:
- Embedding Dimension: 256 (32 dimensions per head)
- Attention Heads: 8
- Layers: 2 Transformer blocks
- Context Length: 8 tokens (configurable)
- Parameters: 27.4M total parameters
- Tokenizer: GPT-2 tokenizer (50,257 tokens)
Our analysis reveals key insights about model scaling:
-
Embedding Dimension: Most significant impact on model size
- Linear scaling from ~15M to ~125M params (200d to 1000d)
- Current 256d provides good balance
-
Number of Heads: Minimal impact on parameter count
- Flat relationship around 27.4M parameters
- 8 heads optimal for current architecture
-
Number of Layers: Linear parameter scaling
- Each layer adds ~4M parameters
- 2 layers sufficient for basic language modeling
-
Block Size: Minimal parameter impact
- Affects memory usage more than parameter count
- Current 8-token context balances efficiency and capability
- Dataset: The Adventures of Sherlock Holmes (~145K tokens)
- Training Time: 3 epochs
- Loss Convergence: 9.0 → 3.5
- Learning Rate: 1e-4 with AdamW optimizer
Here are some sample generations from the trained model:
Prompt: "Holmes examined the evidence..."
"Holmes examined the evidence of those mysteries and improbabilities the whole thing was this wayward, and have been a girl. Turner had an only daughter of the late Ezekiah Hopkins, who was very gentleman in the habit of seeing him, and as his actions was a"
Prompt: "The mystery of..."
"The mystery of the door, which was quite a strong other ones. On the fourth day there came in." "Pray do so." "I did not help you how many singular that I had never seen upon the seat and was"
Prompt: "It was a dark and stormy night when..."
"It was a dark and stormy night when a task of repute in the country-town of Ross. A glass was, in a The Times every morning, with full justice in my cock-and-evident a thing as a dénouement of the little"
Prompt: "Dr. Watson said..."
"Dr. Watson said that we do not talk about day. He was in only ominous words with this wealth for which he ended with the last generation. I had begun to take its name a‚Colonel Lysander Stark's door and attacked it."
- Install dependencies:
pip install -r requirements.txt
- Open and run the Jupyter notebook:
jupyter notebook transformer_language_model.ipynb
-
Architecture Enhancements:
- Increase context length for better coherence
- Add more layers for improved capacity
- Implement advanced sampling methods (top-k, top-p)
-
Training Optimizations:
- Learning rate scheduling
- Gradient clipping
- Larger batch sizes
-
Generation Features:
- Beam search decoding
- Temperature control
- Conditional generation
This project is licensed under the MIT License - see the LICENSE file for details.