A high-performance, header-only C++17/20 text tokenizer for NLP and machine learning. Supports UTF-8, vocabulary encoding, and special tokens like [CLS], [SEP]. Ideal for BERT, DistilBERT, and transformer models. No dependencies!
Unlike HuggingFace Tokenizers (Python) or ICU, this is a lightweight C++ alternative with no dependencies.
Looking to build a custom tokenizer vocabulary? Use Tiny BPE Trainer - a fast, header-only Byte Pair Encoding (BPE) trainer in modern C++.
- Fast: Zero-copy processing with
std::string_view
- UTF-8 Ready: Proper handling of Unicode without heavy dependencies
- Configurable: Fluent API for customizing tokenization behavior
- Header-Only: Single file, easy to integrate
- ASCII Optimized: Smart handling of ASCII vs UTF-8 characters
- Modern C++: Uses C++17/20 features for clean, efficient code
- Vocabulary Support: Load/save vocabularies, encode/decode to token IDs
- Special Tokens: Support for [CLS], [SEP], [PAD], [UNK] tokens
- ML Ready: Sequence encoding for transformer models
- C++17/20 compatible compiler (GCC 7+, Clang 5+, MSVC 2017+)
- No external dependencies - uses only standard library
#include "Modern-Text-Tokenizer.hpp"
using namespace MecanikDev;
// Simple tokenization
auto tokens = TextTokenizer::simple_split("Hello, world!");
// Advanced configuration with vocabulary
TextTokenizer tokenizer;
// Load vocabulary file
tokenizer.load_vocab("vocab.txt");
auto token_ids = tokenizer.encode("Hello, world!");
std::string decoded = tokenizer.decode(token_ids);
// Static method for simple whitespace splitting
std::vector<std::string> tokens = TextTokenizer::simple_split(text);
// Full configurability
TextTokenizer tokenizer;
std::vector<std::string> tokens = tokenizer.tokenize(text);
All configuration methods return TextTokenizer&
for method chaining:
using namespace MecanikDev;
TextTokenizer tokenizer;
tokenizer
.set_lowercase(true) // Convert to lowercase
.set_keep_punctuation(true) // Keep punctuation as separate tokens
.set_split_on_punctuation(true) // Split on punctuation marks
.add_delimiter(',') // Add custom delimiter
.add_delimiters(".,!?") // Add multiple delimiters
.set_special_tokens("[UNK]", "[PAD]", "[CLS]", "[SEP]"); // Configure special tokens
// Load vocabulary from file
tokenizer.load_vocab("vocab.txt");
// Build vocabulary from training texts
std::vector<std::string> training_texts = {"Hello world", "Machine learning", ...};
tokenizer.build_vocab_from_text(training_texts, 2, 30000); // min_freq=2, max_size=30000
// Save vocabulary
tokenizer.save_vocab("my_vocab.txt");
// Encoding and decoding
auto token_ids = tokenizer.encode("Hello world");
std::string text = tokenizer.decode(token_ids);
// Sequence encoding for ML models
auto sequence_ids = tokenizer.encode_sequence("Hello world", 512, true); // max_len=512, add_special_tokens=true
// Count tokens without storing them (memory efficient)
size_t count = tokenizer.count_tokens(text);
// Vocabulary information
size_t vocab_size = tokenizer.vocab_size();
bool has_vocab = tokenizer.has_vocab();
// Special token IDs
int unk_id = tokenizer.get_unk_id();
int pad_id = tokenizer.get_pad_id();
int cls_id = tokenizer.get_cls_id();
int sep_id = tokenizer.get_sep_id();
using namespace MecanikDev;
std::string text = "Natural language processing is amazing!";
// ["Natural", "language", "processing", "is", "amazing!"]
auto tokens = TextTokenizer::simple_split(text);
// Create tokenizer and build vocabulary from training data
TextTokenizer tokenizer;
std::vector<std::string> training_texts = {
"The quick brown fox jumps",
"Machine learning is fascinating",
"Natural language processing rocks"
};
tokenizer
.set_lowercase(true)
.set_split_on_punctuation(true)
.build_vocab_from_text(training_texts, 1, 1000);
// Save vocabulary for later use
tokenizer.save_vocab("my_vocab.txt");
// Encode text to token IDs
auto ids = tokenizer.encode("Machine learning rocks!");
// Example: [1, 156, 234, 445, 2] where 1=[CLS], 2=[SEP], etc.
// Decode back to text
std::string decoded = tokenizer.decode(ids);
// Load pre-trained vocabulary
TextTokenizer tokenizer;
tokenizer.load_vocab("bert_vocab.txt");
// Prepare sequence for BERT-style model
auto input_ids = tokenizer.encode_sequence(
"Hello world! How are you?",
128, // max_length
true // add_special_tokens ([CLS] and [SEP])
);
// Result: [101, 7592, 2088, 999, 2129, 2024, 2017, 1029, 102, ...]
// [CLS] Hello world ! How are you ? [SEP] ...
using namespace MecanikDev;
TextTokenizer preprocessor;
preprocessor
.set_lowercase(true)
.set_split_on_punctuation(true);
// ["hello", "world"]
auto tokens = preprocessor.tokenize("Hello, World!");
TextTokenizer analyzer;
analyzer
.set_keep_punctuation(true)
.set_split_on_punctuation(true);
// ["What", "?", "!", "Really", "?"]
auto tokens = analyzer.tokenize("What?! Really?");
TextTokenizer csv_tokenizer;
csv_tokenizer.add_delimiters(",;|");
// ["name", "age", "city", "country"]
auto fields = csv_tokenizer.tokenize("name,age;city|country");
using namespace MecanikDev;
std::string multilingual = "Hello 世界 🌍 مرحبا";
auto tokens = TextTokenizer::simple_split(multilingual);
// ["Hello", "世界", "🌍", "مرحبا"]
// Lowercase preserves non-ASCII characters
auto lower_tokens = TextTokenizer()
.set_lowercase(true)
.tokenize("Hello 世界");
// ["hello", "世界"] - Chinese characters preserved
# Download the DistilBERT vocabulary
curl -o vocab.txt https://huggingface.co/distilbert/distilbert-base-uncased/raw/main/vocab.txt
# Or using wget
wget https://huggingface.co/distilbert/distilbert-base-uncased/raw/main/vocab.txt
using namespace MecanikDev;
// Load DistilBERT vocabulary
TextTokenizer tokenizer;
if (tokenizer.load_vocab("vocab.txt")) {
std::cout << "Loaded " << tokenizer.vocab_size() << " tokens" << std::endl;
// Configure for DistilBERT-style tokenization
tokenizer
.set_lowercase(true) // DistilBERT uses lowercase
.set_split_on_punctuation(true)
.set_keep_punctuation(true);
// Test encoding
auto token_ids = tokenizer.encode("Hello, world!");
// Result: [7592, 1010, 2088, 999] (example IDs)
// Encode with special tokens for ML
auto sequence = tokenizer.encode_sequence("Hello, world!", 512, true);
// Result: [101, 7592, 1010, 2088, 999, 102] ([CLS] + tokens + [SEP])
// Decode back
std::string text = tokenizer.decode(token_ids);
// Result: "hello , world !"
}
- Zero Dependencies: No ICU, Boost, or other heavy libraries
- UTF-8 Safe: Detects UTF-8 boundaries without corrupting multibyte sequences
- ASCII Optimized: Fast path for ASCII operations (case conversion, punctuation)
- Memory Efficient: Minimal allocations during tokenization
- Configurable: Fluent interface for different use cases
- Time Complexity: O(n) where n is input length
- Space Complexity: O(t) where t is number of tokens
- UTF-8 Handling: O(1) character boundary detection
- Memory: Uses
string_view
for zero-copy input processing
Benchmark results on a typical text corpus:
Performance test with 174000 characters
Results:
Tokenization: 2159 μs (22000 tokens)
Encoding: 1900 μs
Decoding: 430 μs
Total time: 4.49 ms
Throughput: 36.97 MB/s
Benchmark on AMD Ryzen 9 5900X, compiled with -O3.
Simply include the header:
#include "Modern-Text-Tokenizer.hpp"
# Add to your CMakeLists.txt
add_executable(your_app main.cpp Modern-Text-Tokenizer.hpp)
target_compile_features(your_app PRIVATE cxx_std_17)
g++ -std=c++17 -O3 -o tokenizer_demo main.cpp
clang++ -std=c++17 -O3 -o tokenizer_demo main.cpp
The included demo shows various tokenization scenarios:
./tokenizer_demo
Expected output includes:
- Basic tokenization examples
- Unicode handling demonstration
- Performance benchmarks
- Configuration examples
- Regex Support: Pattern-based tokenization
- Streaming API: Process large files without loading into memory
- Parallel Processing: Multi-threaded batch tokenization
- Custom Normalizers: User-defined text preprocessing
- Subword Tokenization: BPE/WordPiece support
- Benchmark Suite: Comprehensive performance testing
- C++20 Features: Ranges, concepts, and modules
- SIMD Optimization: Vectorized string processing
- Memory Mapping: For huge file processing
- Language Detection: Automatic handling of different scripts
Contributions welcome! Areas of interest:
- Performance Optimization: SIMD, better algorithms
- Unicode Enhancement: Better normalization without ICU
- Testing: More edge cases and benchmarks
- Documentation: Examples and tutorials
MIT License - see LICENSE file for details.
- Inspired by modern tokenization libraries like HuggingFace Tokenizers
- UTF-8 handling techniques from various C++ Unicode resources
- Performance optimizations learned from high-performance text processing
⭐ Star this repo if you find it useful!
Built with ❤️ for the C++ and NLP community