Skip to content

sammykao/llm_from_scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM From Scratch

Project Structure Updated!

  • All Jupyter notebooks are now in the notebooks/ folder.
  • All Python scripts are in the scripts/ folder.
  • Data files (text, vocab) are in the data/ folder.
  • Model/tokenizer files are in the models/ folder.

This project is a hands-on exploration of building language models and tokenizers from scratch, with a focus on PyTorch and modern NLP techniques. It includes Jupyter notebooks, training scripts, and utilities for experimenting with bigram models, BPE tokenization, and GPT-style transformers.

Table of Contents


Project Structure

.
├── notebooks/
│   ├── bigram.ipynb           # Bigram language model notebook
│   ├── bpe-v1.ipynb           # Byte Pair Encoding (BPE) tokenizer notebook
│   ├── gpt-v1.ipynb           # GPT-style transformer (v1) notebook
│   ├── gpt-v2.ipynb           # GPT-style transformer (v2, with flash attention) notebook
│   └── torch-examples.ipynb   # PyTorch tensor and function examples
├── scripts/
│   ├── chatbot.py             # Command-line chatbot using a trained model
│   ├── training.py            # Script for training a transformer model
│   ├── data-extract-v2.py     # Data extraction utility
│   └── data-extract-v3.py     # Data extraction utility (v3)
├── data/
│   ├── wizard_of_oz.txt       # Example training text (Wizard of Oz)
│   └── vocab.txt              # Vocabulary file
├── models/
│   └── bpe_tokenizer.json     # Trained BPE tokenizer
├── requirements.txt           # Python dependencies
└── ...

Requirements

  • Python 3.8+
  • See requirements.txt for Python packages:
    • torch
    • pylzma
    • tokenizers
    • tqdm
    • transformers

Install dependencies with:

pip install -r requirements.txt

Setup

Clone the repository and install the requirements as above. Make sure you have your data files (e.g., data/wizard_of_oz.txt) in the data/ directory.


Notebooks Overview

  • notebooks/bigram.ipynb
    Build and train a simple bigram language model on data/wizard_of_oz.txt. Great for understanding the basics of language modeling.

  • notebooks/bpe-v1.ipynb
    Learn and train a Byte Pair Encoding (BPE) tokenizer from scratch, and experiment with HuggingFace tokenizers.

  • notebooks/gpt-v1.ipynb
    Implements a basic GPT-style transformer model in PyTorch, including training and text generation.

  • notebooks/gpt-v2.ipynb
    An advanced GPT-style transformer with flash attention and integration with HuggingFace's BERT tokenizer.

  • notebooks/torch-examples.ipynb
    A playground for PyTorch tensor operations, matrix math, and neural network building blocks.


Scripts Overview

  • scripts/chatbot.py
    Command-line chatbot interface. Loads a trained model and generates completions for user prompts.

  • scripts/training.py
    Script for training a transformer model on your dataset. Handles batching, training loop, and model saving.

  • scripts/data-extract-v2.py / scripts/data-extract-v3.py
    Utilities for extracting and preparing text data for training.


Tokenizer Training

To train a BPE tokenizer on your data, use the code in notebooks/bpe-v1.ipynb. This notebook demonstrates how to:

  • Sample a subset of your data for tokenizer training.
  • Train a BPE tokenizer using the tokenizers library.
  • Save and load the tokenizer for later use (see models/bpe_tokenizer.json).

Training and Chatbot Usage

  1. Prepare your data
    Place your training text (e.g., data/wizard_of_oz.txt) in the data/ directory.

  2. Train a model
    Use scripts/training.py or the relevant notebook (e.g., notebooks/gpt-v1.ipynb) to train a model.
    Example:

    python scripts/training.py -batch_size 32
  3. Chat with your model
    After training, use scripts/chatbot.py to interact with your model:

    python scripts/chatbot.py -batch_size 32

Data

  • data/wizard_of_oz.txt is provided as a sample dataset.
  • data/vocab.txt is a vocabulary file generated from your data.
  • You can use your own text files for training and tokenizer creation.

License

This project is for educational and research purposes.
Feel free to modify and experiment!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published