Transformer Language Model from Scratch

A complete implementation of the Transformer architecture based on the "Attention Is All You Need" paper by Vaswani et al. (2017), trained on the complete works of Sherlock Holmes to generate coherent, Victorian-era prose.

Overview

This project implements a GPT-style Transformer model from scratch using PyTorch, demonstrating:

Multi-Head Self-Attention: The core mechanism that allows the model to attend to different parts of the input sequence
Positional Encoding: Learned position embeddings to understand sequence order
Layer Normalization & Residual Connections: Pre-LN design for better gradient flow
Feedforward Networks: Three-layer MLP with GELU activation
Subword Tokenization: Using tiktoken (GPT-2's tokenizer) for efficient text processing

Model Architecture

The model follows the decoder-only Transformer architecture with:

Embedding Dimension: 256 (32 dimensions per head)
Attention Heads: 8
Layers: 2 Transformer blocks
Context Length: 8 tokens (configurable)
Parameters: 27.4M total parameters
Tokenizer: GPT-2 tokenizer (50,257 tokens)

Parameter Scaling Analysis

Our analysis reveals key insights about model scaling:

Embedding Dimension: Most significant impact on model size
- Linear scaling from ~15M to ~125M params (200d to 1000d)
- Current 256d provides good balance
Number of Heads: Minimal impact on parameter count
- Flat relationship around 27.4M parameters
- 8 heads optimal for current architecture
Number of Layers: Linear parameter scaling
- Each layer adds ~4M parameters
- 2 layers sufficient for basic language modeling
Block Size: Minimal parameter impact
- Affects memory usage more than parameter count
- Current 8-token context balances efficiency and capability

Training Results

Dataset: The Adventures of Sherlock Holmes (~145K tokens)
Training Time: 3 epochs
Loss Convergence: 9.0 → 3.5
Learning Rate: 1e-4 with AdamW optimizer

Example Outputs

Here are some sample generations from the trained model:

Prompt: "Holmes examined the evidence..."
"Holmes examined the evidence of those mysteries and improbabilities the whole thing was this wayward, and have been a girl. Turner had an only daughter of the late Ezekiah Hopkins, who was very gentleman in the habit of seeing him, and as his actions was a"

Prompt: "The mystery of..."
"The mystery of the door, which was quite a strong other ones. On the fourth day there came in." "Pray do so." "I did not help you how many singular that I had never seen upon the seat and was"

Prompt: "It was a dark and stormy night when..."
"It was a dark and stormy night when a task of repute in the country-town of Ross. A glass was, in a The Times every morning, with full justice in my cock-and-evident a thing as a dénouement of the little"

Prompt: "Dr. Watson said..."
"Dr. Watson said that we do not talk about day. He was in only ominous words with this wealth for which he ended with the last generation. I had begun to take its name a‚Colonel Lysander Stark's door and attacked it."

Usage

Install dependencies:

pip install -r requirements.txt

Open and run the Jupyter notebook:

jupyter notebook transformer_language_model.ipynb

Future Improvements

Architecture Enhancements:
- Increase context length for better coherence
- Add more layers for improved capacity
- Implement advanced sampling methods (top-k, top-p)
Training Optimizations:
- Learning rate scheduling
- Gradient clipping
- Larger batch sizes
Generation Features:
- Beam search decoding
- Temperature control
- Conditional generation

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
transformer_language_model.ipynb		transformer_language_model.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Transformer Language Model from Scratch

Overview

Model Architecture

Parameter Scaling Analysis

Training Results

Example Outputs

Usage

Future Improvements

License

About

Uh oh!

Releases

Packages

Languages

License

wahabzh/transformer-from-scratch

Folders and files

Latest commit

History

Repository files navigation

Transformer Language Model from Scratch

Overview

Model Architecture

Parameter Scaling Analysis

Training Results

Example Outputs

Usage

Future Improvements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages