Sinhala Tokenizer Training Suite

Welcome to a fully open and beginner-friendly repository for training high-quality SentencePiece tokenizers, specifically optimized for Sinhala (and extendable to any language).

This project is built from the ground up to teach tokenizer training from scratch - with code, experiments, logging, evaluation, and visualization all covered in detail.

All scripts are heavily commented so that even someone new to NLP can understand what's going on.

What's Inside?

Component	Description
`tokenizer_trainer.py`	Trains multiple tokenizers (grid search over vocab sizes, model types, char coverages) using SentencePiece.
`optuna_tokenizer_search.py`	Uses Optuna to intelligently find the best tokenizer configuration.
`tokenizer_evaluator.py`	Evaluates all trained tokenizers on held-out test data using metrics like `<unk>` ratio and average token count.
`tokenizer_results_visualizer.py`	Visualizes tokenizer evaluation results as bar charts for easy comparison.
`tokenizer_reconstruction_evaulator.py`	Measures how accurately the tokenizer can reconstruct the original text (via Levenshtein distance).
`preprocessing_utils/`	Contains small utility scripts to clean/normalize raw text datasets. Read its README.md before using.

System Requirements

Python 3.8+
sentencepiece, optuna, pandas, matplotlib, seaborn, tqdm (Available in the requirements.txt file)
Multicore CPU recommended (parallel training supported)
GPU not required (only for downstream LM training)

How To Use

1. Setup Environment and Install Dependencies

It's highly recommended to use a Python virtual environment.

Create a virtual environment
```
python -m venv venv
```
Activate the virtual environment Linux/macOS:
```
source venv/bin/activate
```
Windows:
```
venv\Scripts\activate
```
Install required Python packages
```
pip install -r requirements.txt
```

2. Prepare Your Dataset

Create a plain .txt file with one normalized sentence per line.
Preprocess it using tools inside preprocessing_utils/.

3. Train Multiple Tokenizers (Grid Search)

python tokenizer_trainer.py

This will train tokenizers using all combinations of:

model_type: unigram, bpe
vocab_size: 4000 to 64000
character_coverage: 0.998, 0.9995, 1.0 Results are saved to trained_tokenizers/.

4. Or Use Optuna for Best Model Search

python optuna_tokenizer_search.py

This will run an Optuna search (50 trials by default) and find the best-performing tokenizer based on <unk> ratio and token count.

5. Evaluate Tokenizers

python tokenizer_evaluator.py

Evaluates all models on a test set
Saves per-model JSONs + one master all_results.json

6. Visualize the Results

python tokenizer_results_visualizer.py

Generates .pdf bar charts comparing all tokenizers (e.g., avg token count, <unk> ratio, time per line, compression ratio).
Outputs saved to tokenizer_charts/.

7. Measure Reconstruction Fidelity

python tokenizer_reconstruction_evaulator.py

This compares how close reconstructed text is to the original using Levenshtein distance. A great sanity check.

How Best Tokenizer Was Selected

Models are compared on:

Average token count
<unk> token ratio
Tokenization speed
Compression ratio (chars/token)
Reconstruction accuracy (optional)

All evaluations are stored and visualized for transparency and reproducibility.

Contributions & Feedback

If you have suggestions, questions, or want to improve the repo, open an issue or pull request. I'm happy to learn together.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sinhala Tokenizer Training Suite

What's Inside?

System Requirements

How To Use

1. Setup Environment and Install Dependencies

2. Prepare Your Dataset

3. Train Multiple Tokenizers (Grid Search)

4. Or Use Optuna for Best Model Search

5. Evaluate Tokenizers

6. Visualize the Results

7. Measure Reconstruction Fidelity

How Best Tokenizer Was Selected

Contributions & Feedback

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
preprocessing_utils		preprocessing_utils
LICENSE		LICENSE
README.md		README.md
optuna_tokenizer_search.py		optuna_tokenizer_search.py
requirements.txt		requirements.txt
tokenizer_evaluator.py		tokenizer_evaluator.py
tokenizer_reconstruction_evaluator.py		tokenizer_reconstruction_evaluator.py
tokenizer_results_visualizer.py		tokenizer_results_visualizer.py
tokenizer_trainer.py		tokenizer_trainer.py

License

Thisen-Ekanayake/Sinhala-Tokenizer-Training

Folders and files

Latest commit

History

Repository files navigation

Sinhala Tokenizer Training Suite

What's Inside?

System Requirements

How To Use

1. Setup Environment and Install Dependencies

2. Prepare Your Dataset

3. Train Multiple Tokenizers (Grid Search)

4. Or Use Optuna for Best Model Search

5. Evaluate Tokenizers

6. Visualize the Results

7. Measure Reconstruction Fidelity

How Best Tokenizer Was Selected

Contributions & Feedback

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages