This project demonstrates how to build, train, and deploy a small GPT (Generative Pre-trained Transformer) language model from scratch using PyTorch. The model is trained on a short story and can generate text in a similar style.
Note: This code is adapted from Sebastian Raschka's book "Build a Large Language Model From Scratch" (2023).
This project includes:
- A custom GPT implementation with transformer architecture
- Training pipeline with a sliding window approach
- Text generation with temperature control and top-k sampling
- A web interface using Chainlit for interactive text generation
The plot above shows the training and validation loss throughout the model training process. The x-axis represents both epochs and tokens seen, while the y-axis shows the cross-entropy loss. Lower loss values indicate better performance. As training progresses, the model gradually learns to predict the next token in the story more accurately, resulting in decreasing loss values.
This plot demonstrates how temperature scaling affects token probability distributions:
- Low temperature (0.1): Creates a sharper distribution where the model consistently selects the most likely tokens, resulting in more deterministic but potentially repetitive text
- Medium temperature (1.0): Provides a balanced distribution that follows the model's learned probabilities
- High temperature (5.0): Creates a more uniform distribution, increasing randomness and diversity in generated text but potentially reducing coherence
Temperature scaling is a key technique implemented in the project for controlling the creativity versus coherence tradeoff in text generation.
pretrain.ipynb
: Jupyter notebook that contains the model training code and detailed explanationsgpt.py
: Core GPT model implementation including architecture and training utilitiesutils.py
: Helper functions for text generation and tokenizationapp.py
: Chainlit web application for interactive text generationdataset/story.txt
: The dataset used to train the modelpretrain-story-model.pth
: Pre-trained model weights
- Custom GPT Architecture: Implements a small but complete GPT model with multiple transformer blocks
- Sliding Window Training: Uses a sliding window approach to create training examples from the text
- Decoding Strategies: Implements temperature scaling and top-k sampling for controlled text generation
- Interactive Interface: Web interface for text generation using the pre-trained model
- Python 3.8+
- PyTorch
- tiktoken
- chainlit
pip install torch tiktoken chainlit
The model training is done in the pretrain.ipynb
notebook. You can run it to:
- Initialize the GPT model
- Load and prepare the text data
- Train the model on the story dataset
- Evaluate model performance
- Generate text with the trained model
To interact with the trained model through a web interface:
chainlit run app.py
This will start a local web server where you can input prompts and receive generated text from the model.
The model uses the following configuration:
GPT_CONFIG_124M = {
"vocab_size": 50257, # Vocabulary size
"context_length": 256, # Context length
"emb_dim": 768, # Embedding dimension
"n_heads": 12, # Number of attention heads
"n_layers": 12, # Number of layers
"drop_rate": 0.1, # Dropout rate
"qkv_bias": False # Query-key-value bias
}
The project implements two text generation methods:
- Basic generation (
generate_text_simple
): Uses argmax to pick the most likely next token - Advanced generation (
generate
): Supports temperature scaling and top-k sampling for more diverse outputs
Example of generating text:
import tiktoken
from utils import generate, text_to_token_ids, token_ids_to_text
from gpt import GPTModel
import torch
# Load model
model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(torch.load("pretrain-story-model.pth"))
model.eval()
# Generate text
tokenizer = tiktoken.get_encoding("gpt2")
token_ids = generate(
model=model,
idx=text_to_token_ids("Everyone knew who the Babbingtons were.", tokenizer),
max_new_tokens=50,
context_size=256,
top_k=10,
temperature=0.8
)
# Convert to text
generated_text = token_ids_to_text(token_ids, tokenizer)
print(generated_text)
This project demonstrates concepts like:
- Building transformer-based language models
- Cross-entropy loss and perplexity
- Implementing temperature scaling and top-k sampling
- Training language models on custom datasets
This project is based on educational materials and should be used for learning purposes only.