This project is a full from-scratch implementation of a GPT-like transformer model, trained to generate text in the fictional High Valyrian language (from Game of Thrones universe). It was created as a learning project to deeply understand:
- The inner working of Large Language Models (LLMs)
- Every important component: tokenizer, dataloader, transformer, training tricks, inference techniques
- How GPT models are really built under the hood — no high-level libraries like HuggingFace were used.
✅ Goal: Build everything by hand, and learn by building.
✅ Target Audience: Anyone curious about LLM internals, researchers, students, and engineers.
Tool/Library | Why we used it |
---|---|
Python | For clean, flexible model prototyping |
PyTorch | To manually implement Transformer architecture with full control over layers |
Weights & Biases (wandb) | For tracking training metrics, loss curves, debugging models easily |
TorchScript | To export the final trained model for optimized production/inference use |
Matplotlib (optional) | For visualization (loss curves, attention maps in future extensions) |
- Byte Pair Encoding (BPE) tokenizer (custom)
- Causal Self-Attention layer
- Multi-Head Attention with masking
- Transformer Block with pre-layernorm and MLP
- Positional Embeddings
- GPT Decoder Head (Linear layer to predict next token)
Hyperparameter | Value |
---|---|
Embedding dimension | 128 |
Number of heads | 4 |
Number of layers (Transformer blocks) | 4 |
Dropout | 0.2 |
Vocabulary Size | ~500 tokens (custom BPE tokenizer) |
Context Length (Block Size) | 64 tokens |
Feature | GPT-2 (Original) | High Valyrian MiniGPT (This project) |
---|---|---|
Model size | 117M+ parameters | ~1.5M parameters |
Tokenizer | Trained on WebText, massive | Trained on small High Valyrian corpus |
Layers/Heads | 12 layers, 12 heads | 4 layers, 4 heads |
Dropout | Yes | Yes |
Weight initialization | Gaussian Normal | Same (manual) |
LayerNorm | Post-Attention | Pre-Attention (better for small models) |
Training tricks | AdamW, cosine decay | Adam, weight decay, early stopping |
Scaling | Billion-token corpus | Small educational corpus |
✅ Architecturally faithful to GPT-2, but scaled down for fast training and clarity.
- True causal masked attention (can't see future tokens)
- Custom top-p (nucleus) sampling during inference
- Full training, evaluation, checkpointing, and wandb logging
- Exportable via TorchScript for future deployment
- Lightweight and easy to train on a single GPU
You can see the full training progress here:
🔗 Project WandB Dashboard
(includes loss curves, model samples during training, and metrics)
-
Install dependencies:
pip install -r requirements.txt
-
Prepare dataset:
- Add your corpus file to
data/High Valyrian.txt
.
- Add your corpus file to
-
Train BPETokenizer:
python tokenizer/train_tokenizer.py
-
Train model:
python train.py
-
Generate text:
python inference/generate.py --prompt "valar morghulis"
-
Export to TorchScript:
python scripts/export_torchscript.py
- Building LLMs is about small careful engineering choices: masking, layer ordering, normalization, sampling.
- Training tricks (dropout, optimizer tweaks) make a huge difference.
- Inference matters as much as training — generation quality is the true test.
- Tooling (like wandb) is essential to debug and understand model behavior.
- Scaling up (to bigger GPTs) is mostly about data, compute, and model depth — the core principles stay the same.
This project demonstrates not just how to use LLMs, but how to build them — layer by layer, loss by loss, token by token.
If you truly understand something, you can build it yourself.
That is the spirit of this project. 👨🔧🔥