LLM From Scratch

Project Structure Updated!

All Jupyter notebooks are now in the notebooks/ folder.
All Python scripts are in the scripts/ folder.
Data files (text, vocab) are in the data/ folder.
Model/tokenizer files are in the models/ folder.

This project is a hands-on exploration of building language models and tokenizers from scratch, with a focus on PyTorch and modern NLP techniques. It includes Jupyter notebooks, training scripts, and utilities for experimenting with bigram models, BPE tokenization, and GPT-style transformers.

Project Structure

.
├── notebooks/
│   ├── bigram.ipynb           # Bigram language model notebook
│   ├── bpe-v1.ipynb           # Byte Pair Encoding (BPE) tokenizer notebook
│   ├── gpt-v1.ipynb           # GPT-style transformer (v1) notebook
│   ├── gpt-v2.ipynb           # GPT-style transformer (v2, with flash attention) notebook
│   └── torch-examples.ipynb   # PyTorch tensor and function examples
├── scripts/
│   ├── chatbot.py             # Command-line chatbot using a trained model
│   ├── training.py            # Script for training a transformer model
│   ├── data-extract-v2.py     # Data extraction utility
│   └── data-extract-v3.py     # Data extraction utility (v3)
├── data/
│   ├── wizard_of_oz.txt       # Example training text (Wizard of Oz)
│   └── vocab.txt              # Vocabulary file
├── models/
│   └── bpe_tokenizer.json     # Trained BPE tokenizer
├── requirements.txt           # Python dependencies
└── ...

Requirements

Python 3.8+
See requirements.txt for Python packages:
- torch
- pylzma
- tokenizers
- tqdm
- transformers

Install dependencies with:

pip install -r requirements.txt

Setup

Clone the repository and install the requirements as above. Make sure you have your data files (e.g., data/wizard_of_oz.txt) in the data/ directory.

Notebooks Overview

notebooks/bigram.ipynb
Build and train a simple bigram language model on data/wizard_of_oz.txt. Great for understanding the basics of language modeling.
notebooks/bpe-v1.ipynb
Learn and train a Byte Pair Encoding (BPE) tokenizer from scratch, and experiment with HuggingFace tokenizers.
notebooks/gpt-v1.ipynb
Implements a basic GPT-style transformer model in PyTorch, including training and text generation.
notebooks/gpt-v2.ipynb
An advanced GPT-style transformer with flash attention and integration with HuggingFace's BERT tokenizer.
notebooks/torch-examples.ipynb
A playground for PyTorch tensor operations, matrix math, and neural network building blocks.

Scripts Overview

scripts/chatbot.py
Command-line chatbot interface. Loads a trained model and generates completions for user prompts.
scripts/training.py
Script for training a transformer model on your dataset. Handles batching, training loop, and model saving.
scripts/data-extract-v2.py / scripts/data-extract-v3.py
Utilities for extracting and preparing text data for training.

Tokenizer Training

To train a BPE tokenizer on your data, use the code in notebooks/bpe-v1.ipynb. This notebook demonstrates how to:

Sample a subset of your data for tokenizer training.
Train a BPE tokenizer using the tokenizers library.
Save and load the tokenizer for later use (see models/bpe_tokenizer.json).

Training and Chatbot Usage

Prepare your data
Place your training text (e.g., data/wizard_of_oz.txt) in the data/ directory.
Train a model
Use scripts/training.py or the relevant notebook (e.g., notebooks/gpt-v1.ipynb) to train a model.
Example:
```
python scripts/training.py -batch_size 32
```
Chat with your model
After training, use scripts/chatbot.py to interact with your model:
```
python scripts/chatbot.py -batch_size 32
```

Data

data/wizard_of_oz.txt is provided as a sample dataset.
data/vocab.txt is a vocabulary file generated from your data.
You can use your own text files for training and tokenizer creation.

License

This project is for educational and research purposes.
Feel free to modify and experiment!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM From Scratch

Table of Contents

Project Structure

Requirements

Setup

Notebooks Overview

Scripts Overview

Tokenizer Training

Training and Chatbot Usage

Data

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
models		models
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

sammykao/llm_from_scratch

Folders and files

Latest commit

History

Repository files navigation

LLM From Scratch

Table of Contents

Project Structure

Requirements

Setup

Notebooks Overview

Scripts Overview

Tokenizer Training

Training and Chatbot Usage

Data

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages