Tokenization

A small package with bpe-tokenization utilities for LLMs inspired by Andrej Karpaty's minbpe repositoy.

Utilities

GPT4 split pattner
Encoding/decoding methods with custom special tokens dict
Trainable, even from a precedent tokenizer (useful if you want to train on multiple corpus)
Optimizable to eliminate unused tokens/merges
Maps used tokens to reduce LLM vocab size
Supports save/loading from folder

Rewrite everything in C++ to make it faster (already working on it but it requires time)
Add other utilities (like .vocab file)

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Tokenization		Tokenization
.gitignore		.gitignore
README.md		README.md