GPT-2 from Scratch in Rust 🦀

This repository contains a from-scratch implementation of a GPT-2 style Large Language Model (LLM) written entirely in Rust. It utilizes the powerful and flexible Burn deep learning framework.

The primary goal of this project is to serve as an educational resource, demonstrating the core components of a modern LLM in a performant, type-safe language. It is a Rust-based counterpart to the excellent Python/PyTorch implementation by Sebastian Raschka.

✨ Features

✅ GPT-2 Architecture: A clean implementation of the decoder-only transformer architecture from the ground up.
🧠 Text Generation: Generate new text from a given prompt.
🌡️ Temperature Sampling: Control the creativity and randomness of the output.
🔝 Top-K Sampling: Limit token selection to the k most likely next tokens to improve coherence.
🏋️ Model Training: Includes functionality to train the model on a custom text corpus.

📚 Inspiration & Credits

This work is heavily inspired by and serves as a Rust-based companion to the following amazing resources:

LLMs from Scratch by Sebastian Raschka: The original Python/PyTorch guide that this project is based on.
- GitHub Repository: rasbt/LLMs-from-scratch
The Burn Framework: A modern, flexible, and efficient deep learning framework for Rust.
- Official Website: burn.dev

⚙️ Installation

To get started, you'll need the Rust tool-chain installed on your system. You can install it via rustup.rs.

Clone the repository:

git clone https://github.com/Nan-Do/GPT2-Rust
cd GPT2-Rust

Build the project in release mode: Building in release mode is highly recommended for performance.
```
cargo build --release
```

🚀 Usage

The tool provides two main subcommands: generate for creating text and train for training the model.

Generating Text

You can generate text from a prompt using a pre-trained model. The generation process can be customized with several options.

Command:

cargo run --release -- generate [OPTIONS]

Options:
  --vocab-file      vocab file used with the BPE Tokenizer (vocab.json by default)
  --merges-file     merges file used with the BPE Tokenizer (merges.txt by default)
  --text            text to be continued by the model (Hello world! by default)
  --seed            random seed (123 by default).
  --weights         file path for the model weights (weights by default)
  --num-tokens      number of tokens to generate (25 by default).
  --top-k           top k tokens to consider when generating text (disabled by
                    default from 0 to 50257).
  --temperature     temperature used when generating text (disabled by default
                    from 0.0 to 1.0).

Example:

cargo run --release -- generate\
  --text "The last time I saw"\
  --num-tokens 100\
  --temperature 0.7\
  --top-k 150\
  --weights weights

Training the Model

You can train the model from scratch on your own dataset (e.g., a single large .txt file).

Command:

cargo run --release -- train [OPTIONS]

Options:
  --context-length  context length for the GPT Model (1024 by default).
  --emb-dim         embedding dimension for the GPT Model (768 by default).
  --num-layers      number of layers in the Transformer Block (12 by default).
  --num-heads       number of heads for the Multi Head Attention Block (12 by default).
  --epochs          number of epochs to train (10 by default).
  --batch-size      batch size (2 by default).
  --training-file-name
                    text file that will be used to train the model
                    (the-verdict.txt by default)
  --vocab-file      vocab file used with the BPE Tokenizer (vocab.json by default)
  --merges-file     merges file used with the BPE Tokenizer (merges.txt by default)
  --seed            random seed (123 by default).
  --train-ratio     train ratio for training (0.9 by default).
  --weights         file path for the model weights (weights by default)

Example:

cargo run --release -- train\
  --context-length 256\
  --epochs 5\
  --batch-size 4\
  --weights weights

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
merges.txt		merges.txt
the-verdict.txt		the-verdict.txt
vocab.json		vocab.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPT-2 from Scratch in Rust 🦀

✨ Features

📚 Inspiration & Credits

⚙️ Installation

🚀 Usage

Generating Text

Training the Model

About

Uh oh!

Releases

Packages

Languages

License

Nan-Do/GPT2-Rust

Folders and files

Latest commit

History

Repository files navigation

GPT-2 from Scratch in Rust 🦀

✨ Features

📚 Inspiration & Credits

⚙️ Installation

🚀 Usage

Generating Text

Training the Model

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages