A high-performance BPE (Byte Pair Encoding) tokenizer with Python bindings and a header-only C++ implementation.
- 🚀 Fast C++ Core: Header-only C++ implementation for maximum performance
- 🐍 Python Bindings: Easy-to-use Python API with pybind11
- 💾 Serialization: Save and load trained tokenizer models
- 🔧 Header-Only: Simple integration into C++ projects
- 📊 BPE Algorithm: Efficient subword tokenization
- 🎯 Cross-Platform: Works on Windows, macOS, and Linux
Significant improvements in tokenization speed have been made in version 0.1.1 compared to the initial release 0.1.0.
Tokens Processed | Time (v0.1.0) | Time (v0.1.1) | Speedup |
---|---|---|---|
0 | 84,157 µs | 5,502 µs | ~15× |
100 | 6,977,301 µs | 642,335 µs | ~10.9× |
200 | 14,437,683 µs | 1,370,924 µs | ~10.5× |
300 | 20,902,067 µs | 2,154,547 µs | ~9.7× |
400 | 26,554,987 µs | 2,967,434 µs | ~8.9× |
500 | 32,350,267 µs | 3,798,688 µs | ~8.5× |
600 | 38,075,928 µs | 4,630,268 µs | ~8.2× |
700 | 43,831,217 µs | 5,471,428 µs | ~8.0× |
800 | 49,559,857 µs | 6,316,320 µs | ~7.8× |
900 | 56,149,850 µs | 7,166,352 µs | ~7.8× |
1000 | 62,877,499 µs | (Pending) | (N/A) |
⚡ Overall, version 0.1.1
is 7–15× faster across the board due to internal optimizations and improved data structures.
💡 Benchmark run with vocab size of 1000 tokens. Measurements are approximate and may vary slightly based on system specs.
You can visualize the performance improvements in the chart below:
pip install shatokenizer
git https://github.com/shaheen-coder/shatokenizer.git
cd shatokenizer
pip install .
from shatokenizer import ShaTokenizer
# Create tokenizer instance
tokenizer = ShaTokenizer()
# Train on your dataset
tokenizer.train('dataset.txt', 1000)
# Encode text to token IDs
tokens = tokenizer.encode('hello this tokenizer')
print(tokens) # [123, 43, 1211]
# Decode token IDs back to text
text = tokenizer.decode(tokens)
print(text) # "hello this tokenizer"
# Save trained model
tokenizer.save("shatokenizer.pkl")
# Load trained model
tokenizer2 = ShaTokenizer.load("shatokenizer.pkl")
#include <iostream>
#include <shatokenizer/tokenizer.hpp>
int main() {
auto tokenizer = new ShaTokenizer();
tokenizer->train("data.txt", 1000);
auto ids = tokenizer->encode("hello");
std::cout << "decode: " << tokenizer->decode(ids) << std::endl;
delete tokenizer;
return 0;
}
Creates a new tokenizer instance.
Trains the tokenizer on the provided dataset.
dataset_path
: Path to the training text filevocab_size
: Target vocabulary size
Encodes text into token IDs.
text
: Input text to encode- Returns: List of token IDs
Decodes token IDs back to text.
tokens
: List of token IDs- Returns: Decoded text string
Saves the trained tokenizer model.
path
: Output file path
Loads a trained tokenizer model.
path
: Path to saved model file- Returns: Loaded tokenizer instance
- Python 3.7+
- C++17 compatible compiler
- pybind11
- CMake (for C++ development)
# Clone the repository
git clone https://github.com/yourusername/shatokenizer.git
cd shatokenizer
# Install in development mode
pip install -e .
We welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
- Install development dependencies:
pip install -e ".[dev]"
- Run tests:
pytest
- Python code should follow PEP 8
- C++ code should follow Google C++ Style Guide
- Use
black
for Python formatting - Use
clang-format
for C++ formatting
- Create a feature branch:
git checkout -b feature-name
- Make your changes and add tests
- Ensure all tests pass:
pytest
- Format your code:
black .
andclang-format -i src/*.cpp src/*.hpp
- Commit your changes:
git commit -m "Add feature"
- Push to your fork:
git push origin feature-name
- Open a Pull Request
ShaTokenizer is designed for high performance:
- C++ core implementation for speed-critical operations
- Minimal Python overhead with pybind11
- Efficient memory usage with header-only design
- Optimized BPE algorithm implementation
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with pybind11
- Inspired by modern tokenization libraries
- Thanks to all contributors
Made with ❤️ by Shaheen