Project Structure Updated!
- All Jupyter notebooks are now in the
notebooks/
folder. - All Python scripts are in the
scripts/
folder. - Data files (text, vocab) are in the
data/
folder. - Model/tokenizer files are in the
models/
folder.
This project is a hands-on exploration of building language models and tokenizers from scratch, with a focus on PyTorch and modern NLP techniques. It includes Jupyter notebooks, training scripts, and utilities for experimenting with bigram models, BPE tokenization, and GPT-style transformers.
- Project Structure
- Requirements
- Setup
- Notebooks Overview
- Scripts Overview
- Tokenizer Training
- Training and Chatbot Usage
- Data
- License
.
├── notebooks/
│ ├── bigram.ipynb # Bigram language model notebook
│ ├── bpe-v1.ipynb # Byte Pair Encoding (BPE) tokenizer notebook
│ ├── gpt-v1.ipynb # GPT-style transformer (v1) notebook
│ ├── gpt-v2.ipynb # GPT-style transformer (v2, with flash attention) notebook
│ └── torch-examples.ipynb # PyTorch tensor and function examples
├── scripts/
│ ├── chatbot.py # Command-line chatbot using a trained model
│ ├── training.py # Script for training a transformer model
│ ├── data-extract-v2.py # Data extraction utility
│ └── data-extract-v3.py # Data extraction utility (v3)
├── data/
│ ├── wizard_of_oz.txt # Example training text (Wizard of Oz)
│ └── vocab.txt # Vocabulary file
├── models/
│ └── bpe_tokenizer.json # Trained BPE tokenizer
├── requirements.txt # Python dependencies
└── ...
- Python 3.8+
- See
requirements.txt
for Python packages:- torch
- pylzma
- tokenizers
- tqdm
- transformers
Install dependencies with:
pip install -r requirements.txt
Clone the repository and install the requirements as above. Make sure you have your data files (e.g., data/wizard_of_oz.txt
) in the data/
directory.
-
notebooks/bigram.ipynb
Build and train a simple bigram language model ondata/wizard_of_oz.txt
. Great for understanding the basics of language modeling. -
notebooks/bpe-v1.ipynb
Learn and train a Byte Pair Encoding (BPE) tokenizer from scratch, and experiment with HuggingFace tokenizers. -
notebooks/gpt-v1.ipynb
Implements a basic GPT-style transformer model in PyTorch, including training and text generation. -
notebooks/gpt-v2.ipynb
An advanced GPT-style transformer with flash attention and integration with HuggingFace's BERT tokenizer. -
notebooks/torch-examples.ipynb
A playground for PyTorch tensor operations, matrix math, and neural network building blocks.
-
scripts/chatbot.py
Command-line chatbot interface. Loads a trained model and generates completions for user prompts. -
scripts/training.py
Script for training a transformer model on your dataset. Handles batching, training loop, and model saving. -
scripts/data-extract-v2.py / scripts/data-extract-v3.py
Utilities for extracting and preparing text data for training.
To train a BPE tokenizer on your data, use the code in notebooks/bpe-v1.ipynb
. This notebook demonstrates how to:
- Sample a subset of your data for tokenizer training.
- Train a BPE tokenizer using the
tokenizers
library. - Save and load the tokenizer for later use (see
models/bpe_tokenizer.json
).
-
Prepare your data
Place your training text (e.g.,data/wizard_of_oz.txt
) in thedata/
directory. -
Train a model
Usescripts/training.py
or the relevant notebook (e.g.,notebooks/gpt-v1.ipynb
) to train a model.
Example:python scripts/training.py -batch_size 32
-
Chat with your model
After training, usescripts/chatbot.py
to interact with your model:python scripts/chatbot.py -batch_size 32
data/wizard_of_oz.txt
is provided as a sample dataset.data/vocab.txt
is a vocabulary file generated from your data.- You can use your own text files for training and tokenizer creation.
This project is for educational and research purposes.
Feel free to modify and experiment!