Welcome to a fully open and beginner-friendly repository for training high-quality SentencePiece tokenizers, specifically optimized for Sinhala (and extendable to any language).
This project is built from the ground up to teach tokenizer training from scratch - with code, experiments, logging, evaluation, and visualization all covered in detail.
All scripts are heavily commented so that even someone new to NLP can understand what's going on.
Component | Description |
---|---|
tokenizer_trainer.py |
Trains multiple tokenizers (grid search over vocab sizes, model types, char coverages) using SentencePiece. |
optuna_tokenizer_search.py |
Uses Optuna to intelligently find the best tokenizer configuration. |
tokenizer_evaluator.py |
Evaluates all trained tokenizers on held-out test data using metrics like <unk> ratio and average token count. |
tokenizer_results_visualizer.py |
Visualizes tokenizer evaluation results as bar charts for easy comparison. |
tokenizer_reconstruction_evaulator.py |
Measures how accurately the tokenizer can reconstruct the original text (via Levenshtein distance). |
preprocessing_utils/ |
Contains small utility scripts to clean/normalize raw text datasets. Read its README.md before using. |
- Python 3.8+
sentencepiece
,optuna
,pandas
,matplotlib
,seaborn
,tqdm
(Available in therequirements.txt
file)- Multicore CPU recommended (parallel training supported)
- GPU not required (only for downstream LM training)
It's highly recommended to use a Python virtual environment.
-
Create a virtual environment
python -m venv venv
-
Activate the virtual environment Linux/macOS:
source venv/bin/activate
Windows:
venv\Scripts\activate
-
Install required Python packages
pip install -r requirements.txt
- Create a plain
.txt
file with one normalized sentence per line. - Preprocess it using tools inside
preprocessing_utils/
.
python tokenizer_trainer.py
This will train tokenizers using all combinations of:
model_type
:unigram
,bpe
vocab_size
:4000
to64000
character_coverage
:0.998
,0.9995
,1.0
Results are saved totrained_tokenizers/
.
python optuna_tokenizer_search.py
This will run an Optuna search (50 trials by default) and find the best-performing tokenizer based on <unk>
ratio and token count.
python tokenizer_evaluator.py
- Evaluates all models on a test set
- Saves per-model JSONs + one master
all_results.json
python tokenizer_results_visualizer.py
- Generates
.pdf
bar charts comparing all tokenizers (e.g., avg token count,<unk>
ratio, time per line, compression ratio). - Outputs saved to
tokenizer_charts/
.
python tokenizer_reconstruction_evaulator.py
This compares how close reconstructed text is to the original using Levenshtein distance. A great sanity check.
Models are compared on:
- Average token count
<unk>
token ratio- Tokenization speed
- Compression ratio (chars/token)
- Reconstruction accuracy (optional)
All evaluations are stored and visualized for transparency and reproducibility.
If you have suggestions, questions, or want to improve the repo, open an issue or pull request. I'm happy to learn together.