Skip to content

🤖 Complete Transformer implementation from scratch using PyTorch. Trained on Sherlock Holmes stories to generate Victorian-era prose.

License

Notifications You must be signed in to change notification settings

wahabzh/transformer-from-scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Transformer Language Model from Scratch

Python PyTorch License

A complete implementation of the Transformer architecture based on the "Attention Is All You Need" paper by Vaswani et al. (2017), trained on the complete works of Sherlock Holmes to generate coherent, Victorian-era prose.

Overview

This project implements a GPT-style Transformer model from scratch using PyTorch, demonstrating:

  • Multi-Head Self-Attention: The core mechanism that allows the model to attend to different parts of the input sequence
  • Positional Encoding: Learned position embeddings to understand sequence order
  • Layer Normalization & Residual Connections: Pre-LN design for better gradient flow
  • Feedforward Networks: Three-layer MLP with GELU activation
  • Subword Tokenization: Using tiktoken (GPT-2's tokenizer) for efficient text processing

Model Architecture

The model follows the decoder-only Transformer architecture with:

  • Embedding Dimension: 256 (32 dimensions per head)
  • Attention Heads: 8
  • Layers: 2 Transformer blocks
  • Context Length: 8 tokens (configurable)
  • Parameters: 27.4M total parameters
  • Tokenizer: GPT-2 tokenizer (50,257 tokens)

Parameter Scaling Analysis

Our analysis reveals key insights about model scaling:

  1. Embedding Dimension: Most significant impact on model size

    • Linear scaling from ~15M to ~125M params (200d to 1000d)
    • Current 256d provides good balance
  2. Number of Heads: Minimal impact on parameter count

    • Flat relationship around 27.4M parameters
    • 8 heads optimal for current architecture
  3. Number of Layers: Linear parameter scaling

    • Each layer adds ~4M parameters
    • 2 layers sufficient for basic language modeling
  4. Block Size: Minimal parameter impact

    • Affects memory usage more than parameter count
    • Current 8-token context balances efficiency and capability

Training Results

  • Dataset: The Adventures of Sherlock Holmes (~145K tokens)
  • Training Time: 3 epochs
  • Loss Convergence: 9.0 → 3.5
  • Learning Rate: 1e-4 with AdamW optimizer

Example Outputs

Here are some sample generations from the trained model:

Prompt: "Holmes examined the evidence..."
"Holmes examined the evidence of those mysteries and improbabilities the whole thing was this wayward, and have been a girl. Turner had an only daughter of the late Ezekiah Hopkins, who was very gentleman in the habit of seeing him, and as his actions was a"

Prompt: "The mystery of..."
"The mystery of the door, which was quite a strong other ones. On the fourth day there came in." "Pray do so." "I did not help you how many singular that I had never seen upon the seat and was"

Prompt: "It was a dark and stormy night when..."
"It was a dark and stormy night when a task of repute in the country-town of Ross. A glass was, in a The Times every morning, with full justice in my cock-and-evident a thing as a dénouement of the little"

Prompt: "Dr. Watson said..."
"Dr. Watson said that we do not talk about day. He was in only ominous words with this wealth for which he ended with the last generation. I had begun to take its name a‚Colonel Lysander Stark's door and attacked it."

Usage

  1. Install dependencies:
pip install -r requirements.txt
  1. Open and run the Jupyter notebook:
jupyter notebook transformer_language_model.ipynb

Future Improvements

  1. Architecture Enhancements:

    • Increase context length for better coherence
    • Add more layers for improved capacity
    • Implement advanced sampling methods (top-k, top-p)
  2. Training Optimizations:

    • Learning rate scheduling
    • Gradient clipping
    • Larger batch sizes
  3. Generation Features:

    • Beam search decoding
    • Temperature control
    • Conditional generation

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

🤖 Complete Transformer implementation from scratch using PyTorch. Trained on Sherlock Holmes stories to generate Victorian-era prose.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published