ShaTokenizer 0.1.2

A high-performance BPE (Byte Pair Encoding) tokenizer with Python bindings and a header-only C++ implementation.

Features

🚀 Fast C++ Core: Header-only C++ implementation for maximum performance
🐍 Python Bindings: Easy-to-use Python API with pybind11
💾 Serialization: Save and load trained tokenizer models
🔧 Header-Only: Simple integration into C++ projects
📊 BPE Algorithm: Efficient subword tokenization
🎯 Cross-Platform: Works on Windows, macOS, and Linux

⏱️ Tokenization Time Benchmark

🔄 Version Comparison: `v0.1.1` vs `v0.1.0`

Significant improvements in tokenization speed have been made in version 0.1.1 compared to the initial release 0.1.0.

Tokens Processed	Time (v0.1.0)	Time (v0.1.1)	Speedup
0	84,157 µs	5,502 µs	~15×
100	6,977,301 µs	642,335 µs	~10.9×
200	14,437,683 µs	1,370,924 µs	~10.5×
300	20,902,067 µs	2,154,547 µs	~9.7×
400	26,554,987 µs	2,967,434 µs	~8.9×
500	32,350,267 µs	3,798,688 µs	~8.5×
600	38,075,928 µs	4,630,268 µs	~8.2×
700	43,831,217 µs	5,471,428 µs	~8.0×
800	49,559,857 µs	6,316,320 µs	~7.8×
900	56,149,850 µs	7,166,352 µs	~7.8×
1000	62,877,499 µs	(Pending)	(N/A)

⚡ Overall, version 0.1.1 is 7–15× faster across the board due to internal optimizations and improved data structures.

💡 Benchmark run with vocab size of 1000 tokens. Measurements are approximate and may vary slightly based on system specs.

📈 Visual Benchmark

You can visualize the performance improvements in the chart below:

Time Comparison (Lower is Better)

Installation

From PyPI

pip install shatokenizer

From Source

git https://github.com/shaheen-coder/shatokenizer.git
cd shatokenizer
pip install .

Quick Start

Python Usage

from shatokenizer import ShaTokenizer

# Create tokenizer instance
tokenizer = ShaTokenizer()

# Train on your dataset
tokenizer.train('dataset.txt', 1000)

# Encode text to token IDs
tokens = tokenizer.encode('hello this tokenizer')
print(tokens)  # [123, 43, 1211]

# Decode token IDs back to text
text = tokenizer.decode(tokens)
print(text)  # "hello this tokenizer"

# Save trained model
tokenizer.save("shatokenizer.pkl")

# Load trained model
tokenizer2 = ShaTokenizer.load("shatokenizer.pkl")

C++ Usage

#include <iostream>
#include <shatokenizer/tokenizer.hpp>

int main() {
    auto tokenizer = new ShaTokenizer();
    tokenizer->train("data.txt", 1000);
    
    auto ids = tokenizer->encode("hello");
    std::cout << "decode: " << tokenizer->decode(ids) << std::endl;
    
    delete tokenizer;
    return 0;
}

API Reference

Python API

`ShaTokenizer()`

Creates a new tokenizer instance.

`train(dataset_path: str, vocab_size: int) -> None`

Trains the tokenizer on the provided dataset.

dataset_path: Path to the training text file
vocab_size: Target vocabulary size

`encode(text: str) -> List[int]`

Encodes text into token IDs.

text: Input text to encode
Returns: List of token IDs

`decode(tokens: List[int]) -> str`

Decodes token IDs back to text.

tokens: List of token IDs
Returns: Decoded text string

`save(path: str) -> None`

Saves the trained tokenizer model.

path: Output file path

`load(path: str) -> ShaTokenizer` (static method)

Loads a trained tokenizer model.

path: Path to saved model file
Returns: Loaded tokenizer instance

Building from Source

Requirements

Python 3.7+
C++17 compatible compiler
pybind11
CMake (for C++ development)

Build Instructions

# Clone the repository
git clone https://github.com/yourusername/shatokenizer.git
cd shatokenizer

# Install in development mode
pip install -e .

Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup

Fork the repository

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install development dependencies:
```
pip install -e ".[dev]"
```
Run tests:
```
pytest
```

Code Style

Python code should follow PEP 8
C++ code should follow Google C++ Style Guide
Use black for Python formatting
Use clang-format for C++ formatting

Submitting Changes

Create a feature branch: git checkout -b feature-name
Make your changes and add tests
Ensure all tests pass: pytest
Format your code: black . and clang-format -i src/*.cpp src/*.hpp
Commit your changes: git commit -m "Add feature"
Push to your fork: git push origin feature-name
Open a Pull Request

Performance

ShaTokenizer is designed for high performance:

C++ core implementation for speed-critical operations
Minimal Python overhead with pybind11
Efficient memory usage with header-only design
Optimized BPE algorithm implementation

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built with pybind11
Inspired by modern tokenization libraries
Thanks to all contributors

Support

Made with ❤️ by Shaheen

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
bindings		bindings
src/shatokenizer		src/shatokenizer
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

License

shaheen-coder/shatokenizer

Folders and files

Latest commit

History

Repository files navigation

ShaTokenizer 0.1.2

Features

⏱️ Tokenization Time Benchmark

🔄 Version Comparison: v0.1.1 vs v0.1.0

📈 Visual Benchmark

Time Comparison (Lower is Better)

Installation

From PyPI

From Source

Quick Start

Python Usage

C++ Usage

API Reference

Python API

ShaTokenizer()

train(dataset_path: str, vocab_size: int) -> None

encode(text: str) -> List[int]

decode(tokens: List[int]) -> str

save(path: str) -> None

load(path: str) -> ShaTokenizer (static method)

Building from Source

Requirements

Build Instructions

Contributing

Development Setup

Code Style

Submitting Changes

Performance

License

Acknowledgments

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

🔄 Version Comparison: `v0.1.1` vs `v0.1.0`

`ShaTokenizer()`

`train(dataset_path: str, vocab_size: int) -> None`

`encode(text: str) -> List[int]`

`decode(tokens: List[int]) -> str`

`save(path: str) -> None`

`load(path: str) -> ShaTokenizer` (static method)

Packages