Language models (LMs) are challenging to adapt to new distributions by simply finetuning because their subword tokenizers remain unchanged during adaptation. FLEXITOKENS addresses this by using a simplified training objective that enables significantly greater flexibility during adaptation.
An example of tokenized medical text, where FLEXITOKENS produces a less fragmented sequence of tokens than BPE. Unlike BPE which applies a fixed tokenization, FLEXITOKENS adapts its tokenization to the medical domain, capturing domain-specific patterns more effectively.
flexitokens/
├── README.md # This file
├── requirements.txt # Python dependencies
├── .gitignore # Git ignore rules
│
├── src/ # Source code
│ ├── model/ # Model implementations
│ ├── train/ # Training scripts and utilities
│ ├── eval/ # Evaluation scripts
│ ├── finetune/ # Finetuning utilities
│ └── utils/ # Common utilities
│
├── configs/ # Configuration files
│ ├── train/ # Training configurations
│ ├── finetune/ # Finetuning configurations
│ └── accelerate/ # Accelerate configurations
│
├── scripts/ # Execution scripts
│ ├── run_train.sh # Main training script
│ ├── eval/ # Evaluation scripts
│ │ └── eval_pretrain.sh
│ └── finetune/ # Finetuning scripts
│ ├── sib200_routing.sh
│ └── ner_wikiann.sh
│
├── data/ # Dataset directory (created after download)
├── model_ckpts/ # Model checkpoints
├── results/ # Experimental results
└── paper/ # Paper and documentation
# Create conda environment
conda create -n fxt python=3.8
conda activate fxt
# Install dependencies
pip install -r requirements.txt
We use multilingual data sampled from Fineweb and FineWeb2 Our training data includes multiple languages with different scripts:
- Languages: English (en), Spanish (es), Russian (ru), Ukrainian (uk), Hindi (hi), Telugu (te)
- Scripts: Latin, Cyrillic, Devanagari, Telugu
- Data will be automatically downloaded from HuggingFace on first run
- Set
load_from_disk: false
in config for initial download - Set
load_from_disk: true
for subsequent runs to use downloaded data
Configuration files are located in configs/
. Key sections to modify:
boundaries
: Script-specific tokenization settingsprior_list
: Tokenization priors controlling compression rates
data
: Dataset paths and language settingsmodel
: Model architecture parameters
-
Select Configuration: Choose or create a config file in
configs/train/
# Example configs available: ls configs/train/
-
Set Directories: Update paths in your config:
data
: Path to your data directorycache_dir
: Path for caching processed datasets- Experiment output directory
-
First Run Setup:
- Set
load_from_disk: false
to download datasets from HuggingFace - This downloads the exact dataset we used to your data directory
- Set
-
Subsequent Runs:
- Set
load_from_disk: true
to use the downloaded data after initial download
- Set
# Run pretraining with FlexiTokens
bash scripts/run_train.sh
# Evaluate pretrained model
bash scripts/eval/eval_pretrain.sh
# SIB-200 multilingual benchmark
bash scripts/finetune/sib200_routing.sh
If you use FlexiTokens in your research, please cite our paper:
@article{owodunni2025flexitokens,
title={FLEXITOKENS: Flexible Tokenization for Evolving Language Models},
author={Owodunni, Abraham Toluase and Ahia, Orevaoghene and Kumar, Sachin},
journal={arXiv preprint arXiv:2507.12720},
year={2025}
}