Skip to content

A complete, beginner-friendly pipeline to train, evaluate, and select the best SentencePiece tokenizer - with detailed comments and utilities.

License

Notifications You must be signed in to change notification settings

Thisen-Ekanayake/Sinhala-Tokenizer-Training

Repository files navigation

Sinhala Tokenizer Training Suite

Welcome to a fully open and beginner-friendly repository for training high-quality SentencePiece tokenizers, specifically optimized for Sinhala (and extendable to any language).

This project is built from the ground up to teach tokenizer training from scratch - with code, experiments, logging, evaluation, and visualization all covered in detail.

All scripts are heavily commented so that even someone new to NLP can understand what's going on.


What's Inside?

Component Description
tokenizer_trainer.py Trains multiple tokenizers (grid search over vocab sizes, model types, char coverages) using SentencePiece.
optuna_tokenizer_search.py Uses Optuna to intelligently find the best tokenizer configuration.
tokenizer_evaluator.py Evaluates all trained tokenizers on held-out test data using metrics like <unk> ratio and average token count.
tokenizer_results_visualizer.py Visualizes tokenizer evaluation results as bar charts for easy comparison.
tokenizer_reconstruction_evaulator.py Measures how accurately the tokenizer can reconstruct the original text (via Levenshtein distance).
preprocessing_utils/ Contains small utility scripts to clean/normalize raw text datasets. Read its README.md before using.

System Requirements

  • Python 3.8+
  • sentencepiece, optuna, pandas, matplotlib, seaborn, tqdm (Available in the requirements.txt file)
  • Multicore CPU recommended (parallel training supported)
  • GPU not required (only for downstream LM training)

How To Use

1. Setup Environment and Install Dependencies

It's highly recommended to use a Python virtual environment.

  • Create a virtual environment

    python -m venv venv
    
  • Activate the virtual environment Linux/macOS:

    source venv/bin/activate
    

    Windows:

    venv\Scripts\activate
    
  • Install required Python packages

    pip install -r requirements.txt
    

2. Prepare Your Dataset

  • Create a plain .txt file with one normalized sentence per line.
  • Preprocess it using tools inside preprocessing_utils/.

3. Train Multiple Tokenizers (Grid Search)

python tokenizer_trainer.py

This will train tokenizers using all combinations of:

  • model_type: unigram, bpe
  • vocab_size: 4000 to 64000
  • character_coverage: 0.998, 0.9995, 1.0 Results are saved to trained_tokenizers/.

4. Or Use Optuna for Best Model Search

python optuna_tokenizer_search.py

This will run an Optuna search (50 trials by default) and find the best-performing tokenizer based on <unk> ratio and token count.

5. Evaluate Tokenizers

python tokenizer_evaluator.py
  • Evaluates all models on a test set
  • Saves per-model JSONs + one master all_results.json

6. Visualize the Results

python tokenizer_results_visualizer.py
  • Generates .pdf bar charts comparing all tokenizers (e.g., avg token count, <unk> ratio, time per line, compression ratio).
  • Outputs saved to tokenizer_charts/.

7. Measure Reconstruction Fidelity

python tokenizer_reconstruction_evaulator.py

This compares how close reconstructed text is to the original using Levenshtein distance. A great sanity check.


How Best Tokenizer Was Selected

Models are compared on:

  • Average token count
  • <unk> token ratio
  • Tokenization speed
  • Compression ratio (chars/token)
  • Reconstruction accuracy (optional)

All evaluations are stored and visualized for transparency and reproducibility.


Contributions & Feedback

If you have suggestions, questions, or want to improve the repo, open an issue or pull request. I'm happy to learn together.

About

A complete, beginner-friendly pipeline to train, evaluate, and select the best SentencePiece tokenizer - with detailed comments and utilities.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages