Story-Based GPT Model Pretraining Project

This project demonstrates how to build, train, and deploy a small GPT (Generative Pre-trained Transformer) language model from scratch using PyTorch. The model is trained on a short story and can generate text in a similar style.

Note: This code is adapted from Sebastian Raschka's book "Build a Large Language Model From Scratch" (2023).

Project Overview

This project includes:

A custom GPT implementation with transformer architecture
Training pipeline with a sliding window approach
Text generation with temperature control and top-k sampling
A web interface using Chainlit for interactive text generation

Training Results

Loss Progression During Training

The plot above shows the training and validation loss throughout the model training process. The x-axis represents both epochs and tokens seen, while the y-axis shows the cross-entropy loss. Lower loss values indicate better performance. As training progresses, the model gradually learns to predict the next token in the story more accurately, resulting in decreasing loss values.

Temperature Scaling Effect

This plot demonstrates how temperature scaling affects token probability distributions:

Low temperature (0.1): Creates a sharper distribution where the model consistently selects the most likely tokens, resulting in more deterministic but potentially repetitive text
Medium temperature (1.0): Provides a balanced distribution that follows the model's learned probabilities
High temperature (5.0): Creates a more uniform distribution, increasing randomness and diversity in generated text but potentially reducing coherence

Temperature scaling is a key technique implemented in the project for controlling the creativity versus coherence tradeoff in text generation.

Project Structure

pretrain.ipynb: Jupyter notebook that contains the model training code and detailed explanations
gpt.py: Core GPT model implementation including architecture and training utilities
utils.py: Helper functions for text generation and tokenization
app.py: Chainlit web application for interactive text generation
dataset/story.txt: The dataset used to train the model
pretrain-story-model.pth: Pre-trained model weights

Features

Custom GPT Architecture: Implements a small but complete GPT model with multiple transformer blocks
Sliding Window Training: Uses a sliding window approach to create training examples from the text
Decoding Strategies: Implements temperature scaling and top-k sampling for controlled text generation
Interactive Interface: Web interface for text generation using the pre-trained model

Requirements

Python 3.8+
PyTorch
tiktoken
chainlit

Installation

pip install torch tiktoken chainlit

Usage

Training the Model

The model training is done in the pretrain.ipynb notebook. You can run it to:

Initialize the GPT model
Load and prepare the text data
Train the model on the story dataset
Evaluate model performance
Generate text with the trained model

Running the Web Interface

To interact with the trained model through a web interface:

chainlit run app.py

This will start a local web server where you can input prompts and receive generated text from the model.

Model Configuration

The model uses the following configuration:

GPT_CONFIG_124M = {
    "vocab_size": 50257,   # Vocabulary size
    "context_length": 256, # Context length
    "emb_dim": 768,        # Embedding dimension
    "n_heads": 12,         # Number of attention heads
    "n_layers": 12,        # Number of layers
    "drop_rate": 0.1,      # Dropout rate
    "qkv_bias": False      # Query-key-value bias
}

Text Generation

The project implements two text generation methods:

Basic generation (generate_text_simple): Uses argmax to pick the most likely next token
Advanced generation (generate): Supports temperature scaling and top-k sampling for more diverse outputs

Example of generating text:

import tiktoken
from utils import generate, text_to_token_ids, token_ids_to_text
from gpt import GPTModel
import torch

# Load model
model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(torch.load("pretrain-story-model.pth"))
model.eval()

# Generate text
tokenizer = tiktoken.get_encoding("gpt2")
token_ids = generate(
    model=model,
    idx=text_to_token_ids("Everyone knew who the Babbingtons were.", tokenizer),
    max_new_tokens=50,
    context_size=256,
    top_k=10,
    temperature=0.8
)

# Convert to text
generated_text = token_ids_to_text(token_ids, tokenizer)
print(generated_text)

Learning Resources

This project demonstrates concepts like:

Building transformer-based language models
Cross-entropy loss and perplexity
Implementing temperature scaling and top-k sampling
Training language models on custom datasets

License

This project is based on educational materials and should be used for learning purposes only.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
__pycache__		__pycache__
dataset		dataset
.gitignore		.gitignore
README.md		README.md
app.py		app.py
chainlit.md		chainlit.md
gpt.py		gpt.py
loss-plot.jpg		loss-plot.jpg
loss-plot.pdf		loss-plot.pdf
pretrain.ipynb		pretrain.ipynb
temperature-plot.pdf		temperature-plot.pdf
temperature.jpg		temperature.jpg
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Story-Based GPT Model Pretraining Project

Project Overview

Training Results

Loss Progression During Training

Temperature Scaling Effect

Project Structure

Features

Requirements

Installation

Usage

Training the Model

Running the Web Interface

Model Configuration

Text Generation

Learning Resources

License

About

Uh oh!

Releases

Packages

Languages

thinley4/GPT-Pretrain

Folders and files

Latest commit

History

Repository files navigation

Story-Based GPT Model Pretraining Project

Project Overview

Training Results

Loss Progression During Training

Temperature Scaling Effect

Project Structure

Features

Requirements

Installation

Usage

Training the Model

Running the Web Interface

Model Configuration

Text Generation

Learning Resources

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages