A simple and efficient Tamil text tokenizer library with modern Python structure.
- Tamil Text Tokenization: Comprehensive tokenization for Tamil text
- Multiple Tokenization Methods: Word, sentence, character, syllable, and grapheme-level tokenization
- Enhanced Text Normalization: Unicode normalization, digit standardization, punctuation standardization
- Script Information Analysis: Comprehensive script detection, complexity scoring, and readability assessment
- Language Detection: Automatic Tamil language detection with confidence scores
- Text Validation: Tamil text validation with configurable thresholds
- Character Type Analysis: Detailed analysis of vowels, consonants, conjuncts, and other character types
- Modern Python API: Clean, type-hinted interface with both functional and object-oriented approaches
- Command Line Interface: Full-featured CLI tool for Tamil text processing
- Fast Processing: Efficient regex-based operations
- Error Handling: Comprehensive exception handling with meaningful error messages
- Well Tested: Extensive test suite with high coverage
- Type Hints: Full type annotation support for better IDE experience
pip install tamil-tokenizer
- Python 3.8+
- regex >= 2022.0.0
For development:
pip install tamil-tokenizer[dev]
from tamil_tokenizer import tokenize_words, tokenize_sentences, TamilTokenizer
# Quick tokenization
words = tokenize_words("தமிழ் மொழி அழகான மொழி")
print(f"Words: {words}")
sentences = tokenize_sentences("வணக்கம். நீங்கள் எப்படி இருக்கிறீர்கள்?")
print(f"Sentences: {sentences}")
# Using TamilTokenizer class
tokenizer = TamilTokenizer()
tokens = tokenizer.tokenize("தமிழ் உரை", method="words")
print(f"Tokens: {tokens}")
from tamil_tokenizer import tokenize_words, tokenize_sentences, tokenize_characters
# Word tokenization
text = "தமிழ் மொழி அழகான மொழி"
words = tokenize_words(text)
print(f"Words: {words}")
# Output: ['தமிழ்', 'மொழி', 'அழகான', 'மொழி']
# Sentence tokenization
text = "வணக்கம். நீங்கள் எப்படி இருக்கிறீர்கள்? நன்றாக இருக்கிறேன்!"
sentences = tokenize_sentences(text)
print(f"Sentences: {sentences}")
# Output: ['வணக்கம்', 'நீங்கள் எப்படி இருக்கிறீர்கள்', 'நன்றாக இருக்கிறேன்']
# Character tokenization
text = "தமிழ்"
characters = tokenize_characters(text)
print(f"Characters: {characters}")
# Output: ['த', 'ம', 'ி', 'ழ', '்']
from tamil_tokenizer import TamilTokenizer
# Create tokenizer instance
tokenizer = TamilTokenizer()
# General tokenization method
text = "தமிழ் மொழி அழகான மொழி"
words = tokenizer.tokenize(text, method="words")
sentences = tokenizer.tokenize(text, method="sentences")
characters = tokenizer.tokenize(text, method="characters")
print(f"Words: {words}")
print(f"Sentences: {sentences}")
print(f"Characters: {characters}")
from tamil_tokenizer import clean_text, normalize_text, TamilTokenizer
# Clean text with extra whitespace
messy_text = " தமிழ் மொழி அழகு "
cleaned = clean_text(messy_text)
print(f"Cleaned: '{cleaned}'")
# Output: 'தமிழ் மொழி அழகு'
# Clean text and remove punctuation
tokenizer = TamilTokenizer()
text_with_punct = "தமிழ், மொழி! அழகு?"
cleaned_no_punct = tokenizer.clean_text(text_with_punct, remove_punctuation=True)
print(f"No punctuation: '{cleaned_no_punct}'")
# Output: 'தமிழ் மொழி அழகு'
# Normalize text
normalized = normalize_text(messy_text)
print(f"Normalized: '{normalized}'")
# Output: 'தமிழ் மொழி அழகு'
from tamil_tokenizer import normalize_text, TamilTokenizer
tokenizer = TamilTokenizer()
# Comprehensive normalization with all options
text = " தமிழ்—௧௨௩\u200Cமொழி…அழகான—மொழி "
normalized = tokenizer.normalize_text(
text,
form="NFC", # Unicode normalization
standardize_digits=True, # Tamil digits to Arabic
standardize_punctuation=True, # Standardize punctuation
remove_zero_width=True # Remove invisible characters
)
print(f"Normalized: '{normalized}'")
# Output: 'தமிழ்-123மொழி...அழகான-மொழி'
# Tamil digit standardization
text_with_digits = "தமிழ் ௧௨௩௪ வருடங்கள் பழமையான மொழி"
standardized = normalize_text(text_with_digits, standardize_digits=True)
print(f"Standardized: {standardized}")
# Output: 'தமிழ் 1234 வருடங்கள் பழமையான மொழி'
from tamil_tokenizer import get_script_info, TamilTokenizer
tokenizer = TamilTokenizer()
# Comprehensive script analysis
text = "தமிழ் மொழி உலகின் பழமையான மொழிகளில் ஒன்று"
info = tokenizer.get_script_info(text)
print(f"Tamil percentage: {info['tamil_percentage']:.1f}%")
print(f"Complexity score: {info['complexity_score']:.2f}")
print(f"Readability level: {info['readability_level']}")
print(f"Scripts detected: {info['scripts_detected']}")
print(f"Has conjuncts: {info['has_conjuncts']}")
print(f"Unicode blocks: {info['unicode_blocks']}")
# Character type analysis
char_types = info['character_types']
print(f"Vowels: {char_types['vowels']}")
print(f"Consonants: {char_types['consonants']}")
print(f"Vowel signs: {char_types['vowel_signs']}")
from tamil_tokenizer import detect_language, TamilTokenizer
tokenizer = TamilTokenizer()
# Detect language with confidence
texts = [
"தமிழ் மொழி அழகான மொழி",
"தமிழ் Tamil மொழி Language",
"Hello World English Text"
]
for text in texts:
result = tokenizer.detect_language(text)
print(f"Text: {text}")
print(f"Language: {result['primary_language']}")
print(f"Confidence: {result['confidence']:.2f}")
print(f"Is Tamil: {result['is_tamil']}")
print("---")
from tamil_tokenizer import is_valid_tamil_text, TamilTokenizer
tokenizer = TamilTokenizer()
# Validate Tamil text with different thresholds
texts = [
"தமிழ் மொழி அழகான மொழி",
"தமிழ் Tamil மொழி",
"Hello World"
]
for text in texts:
# Default threshold (50%)
is_valid_default = tokenizer.is_valid_tamil_text(text)
# Strict threshold (80%)
is_valid_strict = tokenizer.is_valid_tamil_text(text, min_tamil_percentage=80.0)
print(f"Text: {text}")
print(f"Valid (50%): {is_valid_default}")
print(f"Valid (80%): {is_valid_strict}")
print("---")
from tamil_tokenizer import TamilTokenizer
tokenizer = TamilTokenizer()
text = "தமிழ் மொழி அழகான மொழி. இது உலகின் பழமையான மொழிகளில் ஒன்று!"
stats = tokenizer.get_statistics(text)
print(f"Total characters: {stats['total_characters']}")
print(f"Tamil characters: {stats['tamil_characters']}")
print(f"Words: {stats['words']}")
print(f"Sentences: {stats['sentences']}")
print(f"Average word length: {stats['average_word_length']:.2f}")
print(f"Average sentence length: {stats['average_sentence_length']:.2f}")
from tamil_tokenizer import tokenize_words
from tamil_tokenizer.exceptions import InvalidTextError, TokenizationError
try:
words = tokenize_words("") # Empty text
except InvalidTextError as e:
print(f"Invalid text: {e}")
try:
words = tokenize_words(None) # None text
except InvalidTextError as e:
print(f"Invalid text: {e}")
The library includes a comprehensive CLI tool:
# Basic word tokenization (default)
tamil-tokenizer "தமிழ் மொழி அழகான மொழி"
# Sentence tokenization
tamil-tokenizer --method sentences "வணக்கம். நலமா?"
# Character tokenization
tamil-tokenizer --method characters "தமிழ்"
# Show text statistics
tamil-tokenizer --stats "தமிழ் உரை"
# Clean text
tamil-tokenizer --clean "தமிழ் உரை"
# Clean text and remove punctuation
tamil-tokenizer --clean --remove-punctuation "தமிழ், உரை!"
# JSON output
tamil-tokenizer --json "தமிழ் மொழி"
# Verbose output
tamil-tokenizer --verbose "தமிழ் மொழி"
# Basic tokenization
$ tamil-tokenizer "தமிழ் மொழி அழகான மொழி"
தமிழ்
மொழி
அழகான
மொழி
# Sentence tokenization with verbose output
$ tamil-tokenizer --method sentences --verbose "வணக்கம். நலமா?"
Tokenization method: sentences
Input text: வணக்கம். நலமா?
Token count: 2
Tokens:
--------------------
1. வணக்கம்
2. நலமா
# Text statistics
$ tamil-tokenizer --stats "தமிழ் மொழி"
Total characters: 9
Tamil characters: 8
Words: 2
Sentences: 1
Average word length: 4.00
Average sentence length: 2.00
# JSON output
$ tamil-tokenizer --json "தமிழ் மொழி"
{
"method": "words",
"input_text": "தமிழ் மொழி",
"tokens": ["தமிழ்", "மொழி"],
"token_count": 2
}
Tokenize Tamil text into words.
Parameters:
text
: Tamil text to tokenize
Returns: List of word tokens
Tokenize Tamil text into sentences.
Parameters:
text
: Tamil text to tokenize
Returns: List of sentence tokens
Tokenize Tamil text into individual characters.
Parameters:
text
: Tamil text to tokenize
Returns: List of character tokens (Tamil characters only)
Clean Tamil text by normalizing whitespace and optionally removing punctuation.
Parameters:
text
: Text to cleanremove_punctuation
: Whether to remove non-Tamil punctuation
Returns: Cleaned text
Normalize Tamil text by cleaning and standardizing format.
Parameters:
text
: Text to normalize
Returns: Normalized text
Main class for Tamil text tokenization operations.
Methods:
tokenize(text, method="words")
: General tokenization methodtokenize_words(text)
: Tokenize into wordstokenize_sentences(text)
: Tokenize into sentencestokenize_characters(text)
: Tokenize into charactersclean_text(text, remove_punctuation=False)
: Clean textnormalize_text(text)
: Normalize textget_statistics(text)
: Get text statistics
Base exception class for tamil-tokenizer library.
Raised when invalid text is provided (None, empty, or non-string).
Raised when tokenization fails due to processing errors.
git clone https://github.com/rajacsp/tamil-tokenizer.git
cd tamil-tokenizer
pip install -e ".[dev]"
pytest
pytest --cov=tamil_tokenizer --cov-report=html
black tamil_tokenizer tests examples
mypy tamil_tokenizer
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Enhanced Text Normalization
- Comprehensive Unicode normalization (NFC, NFD, NFKC, NFKD)
- Tamil digit standardization (௦-௯ to 0-9)
- Punctuation standardization and zero-width character removal
- Script Information Analysis
- Added
get_script_info()
for comprehensive script analysis - Added
detect_language()
for language detection with confidence scores - Added
is_valid_tamil_text()
for Tamil text validation - Character type analysis and complexity scoring
- Advanced Features
- Language detection with confidence scoring
- Text validation with configurable thresholds
- Unicode block identification and readability assessment
- Enhanced convenience functions with full parameter support
- Enhanced Tamil Tokenization
- Added syllable-level tokenization (
tokenize_syllables()
) - Added grapheme cluster tokenization (
tokenize_graphemes()
) - Added word structure analysis (
analyze_word_structure()
) - Improved character tokenization for better Unicode handling
- Enhanced text statistics with Tamil-specific metrics
- Better support for Tamil conjunct consonants and vowel signs
- Advanced Tamil script processing with improved regex patterns
- Fixed character tokenization test compatibility
- Enhanced
tokenize()
method to support "syllables" and "graphemes" - Added comprehensive test coverage for new features
- Initial release
- Basic Tamil text tokenization (words, sentences, characters)
- Text cleaning and normalization
- Command-line interface
- Comprehensive test suite
- Type hints throughout the codebase
- Modern Python packaging with pyproject.toml
This library is specifically designed for Tamil text processing and uses Unicode ranges for Tamil script (U+0B80–U+0BFF). It handles:
- Tamil characters and diacritics
- Common Tamil punctuation
- Mixed Tamil-English text (extracts Tamil portions)
- Various sentence ending patterns
- The Tamil language community for inspiration
- The Python community for excellent libraries like regex
- Contributors and users who help improve this library
If you encounter any issues or have questions, please:
- Check the documentation
- Search existing issues
- Create a new issue if needed
For general questions, you can also reach out via email: raja.csp@gmail.com