tamil-tokenizer

A simple and efficient Tamil text tokenizer library with modern Python structure.

Features

Tamil Text Tokenization: Comprehensive tokenization for Tamil text
Multiple Tokenization Methods: Word, sentence, character, syllable, and grapheme-level tokenization
Enhanced Text Normalization: Unicode normalization, digit standardization, punctuation standardization
Script Information Analysis: Comprehensive script detection, complexity scoring, and readability assessment
Language Detection: Automatic Tamil language detection with confidence scores
Text Validation: Tamil text validation with configurable thresholds
Character Type Analysis: Detailed analysis of vowels, consonants, conjuncts, and other character types
Modern Python API: Clean, type-hinted interface with both functional and object-oriented approaches
Command Line Interface: Full-featured CLI tool for Tamil text processing
Fast Processing: Efficient regex-based operations
Error Handling: Comprehensive exception handling with meaningful error messages
Well Tested: Extensive test suite with high coverage
Type Hints: Full type annotation support for better IDE experience

Installation

pip install tamil-tokenizer

Dependencies

Python 3.8+
regex >= 2022.0.0

Optional Dependencies

For development:

pip install tamil-tokenizer[dev]

Quick Start

from tamil_tokenizer import tokenize_words, tokenize_sentences, TamilTokenizer

# Quick tokenization
words = tokenize_words("தமிழ் மொழி அழகான மொழி")
print(f"Words: {words}")

sentences = tokenize_sentences("வணக்கம். நீங்கள் எப்படி இருக்கிறீர்கள்?")
print(f"Sentences: {sentences}")

# Using TamilTokenizer class
tokenizer = TamilTokenizer()
tokens = tokenizer.tokenize("தமிழ் உரை", method="words")
print(f"Tokens: {tokens}")

Usage Examples

Basic Tokenization

from tamil_tokenizer import tokenize_words, tokenize_sentences, tokenize_characters

# Word tokenization
text = "தமிழ் மொழி அழகான மொழி"
words = tokenize_words(text)
print(f"Words: {words}")
# Output: ['தமிழ்', 'மொழி', 'அழகான', 'மொழி']

# Sentence tokenization
text = "வணக்கம். நீங்கள் எப்படி இருக்கிறீர்கள்? நன்றாக இருக்கிறேன்!"
sentences = tokenize_sentences(text)
print(f"Sentences: {sentences}")
# Output: ['வணக்கம்', 'நீங்கள் எப்படி இருக்கிறீர்கள்', 'நன்றாக இருக்கிறேன்']

# Character tokenization
text = "தமிழ்"
characters = tokenize_characters(text)
print(f"Characters: {characters}")
# Output: ['த', 'ம', 'ி', 'ழ', '்']

Using TamilTokenizer Class

from tamil_tokenizer import TamilTokenizer

# Create tokenizer instance
tokenizer = TamilTokenizer()

# General tokenization method
text = "தமிழ் மொழி அழகான மொழி"
words = tokenizer.tokenize(text, method="words")
sentences = tokenizer.tokenize(text, method="sentences")
characters = tokenizer.tokenize(text, method="characters")

print(f"Words: {words}")
print(f"Sentences: {sentences}")
print(f"Characters: {characters}")

Text Cleaning and Normalization

from tamil_tokenizer import clean_text, normalize_text, TamilTokenizer

# Clean text with extra whitespace
messy_text = "  தமிழ்   மொழி   அழகு  "
cleaned = clean_text(messy_text)
print(f"Cleaned: '{cleaned}'")
# Output: 'தமிழ் மொழி அழகு'

# Clean text and remove punctuation
tokenizer = TamilTokenizer()
text_with_punct = "தமிழ், மொழி! அழகு?"
cleaned_no_punct = tokenizer.clean_text(text_with_punct, remove_punctuation=True)
print(f"No punctuation: '{cleaned_no_punct}'")
# Output: 'தமிழ் மொழி அழகு'

# Normalize text
normalized = normalize_text(messy_text)
print(f"Normalized: '{normalized}'")
# Output: 'தமிழ் மொழி அழகு'

Enhanced Text Normalization

from tamil_tokenizer import normalize_text, TamilTokenizer

tokenizer = TamilTokenizer()

# Comprehensive normalization with all options
text = "  தமிழ்—௧௨௩\u200Cமொழி…அழகான—மொழி  "
normalized = tokenizer.normalize_text(
    text,
    form="NFC",                    # Unicode normalization
    standardize_digits=True,       # Tamil digits to Arabic
    standardize_punctuation=True,  # Standardize punctuation
    remove_zero_width=True         # Remove invisible characters
)
print(f"Normalized: '{normalized}'")
# Output: 'தமிழ்-123மொழி...அழகான-மொழி'

# Tamil digit standardization
text_with_digits = "தமிழ் ௧௨௩௪ வருடங்கள் பழமையான மொழி"
standardized = normalize_text(text_with_digits, standardize_digits=True)
print(f"Standardized: {standardized}")
# Output: 'தமிழ் 1234 வருடங்கள் பழமையான மொழி'

Script Information Analysis

from tamil_tokenizer import get_script_info, TamilTokenizer

tokenizer = TamilTokenizer()

# Comprehensive script analysis
text = "தமிழ் மொழி உலகின் பழமையான மொழிகளில் ஒன்று"
info = tokenizer.get_script_info(text)

print(f"Tamil percentage: {info['tamil_percentage']:.1f}%")
print(f"Complexity score: {info['complexity_score']:.2f}")
print(f"Readability level: {info['readability_level']}")
print(f"Scripts detected: {info['scripts_detected']}")
print(f"Has conjuncts: {info['has_conjuncts']}")
print(f"Unicode blocks: {info['unicode_blocks']}")

# Character type analysis
char_types = info['character_types']
print(f"Vowels: {char_types['vowels']}")
print(f"Consonants: {char_types['consonants']}")
print(f"Vowel signs: {char_types['vowel_signs']}")

Language Detection

from tamil_tokenizer import detect_language, TamilTokenizer

tokenizer = TamilTokenizer()

# Detect language with confidence
texts = [
    "தமிழ் மொழி அழகான மொழி",
    "தமிழ் Tamil மொழி Language",
    "Hello World English Text"
]

for text in texts:
    result = tokenizer.detect_language(text)
    print(f"Text: {text}")
    print(f"Language: {result['primary_language']}")
    print(f"Confidence: {result['confidence']:.2f}")
    print(f"Is Tamil: {result['is_tamil']}")
    print("---")

Text Validation

from tamil_tokenizer import is_valid_tamil_text, TamilTokenizer

tokenizer = TamilTokenizer()

# Validate Tamil text with different thresholds
texts = [
    "தமிழ் மொழி அழகான மொழி",
    "தமிழ் Tamil மொழி",
    "Hello World"
]

for text in texts:
    # Default threshold (50%)
    is_valid_default = tokenizer.is_valid_tamil_text(text)
    
    # Strict threshold (80%)
    is_valid_strict = tokenizer.is_valid_tamil_text(text, min_tamil_percentage=80.0)
    
    print(f"Text: {text}")
    print(f"Valid (50%): {is_valid_default}")
    print(f"Valid (80%): {is_valid_strict}")
    print("---")

Text Statistics

from tamil_tokenizer import TamilTokenizer

tokenizer = TamilTokenizer()
text = "தமிழ் மொழி அழகான மொழி. இது உலகின் பழமையான மொழிகளில் ஒன்று!"

stats = tokenizer.get_statistics(text)
print(f"Total characters: {stats['total_characters']}")
print(f"Tamil characters: {stats['tamil_characters']}")
print(f"Words: {stats['words']}")
print(f"Sentences: {stats['sentences']}")
print(f"Average word length: {stats['average_word_length']:.2f}")
print(f"Average sentence length: {stats['average_sentence_length']:.2f}")

Error Handling

from tamil_tokenizer import tokenize_words
from tamil_tokenizer.exceptions import InvalidTextError, TokenizationError

try:
    words = tokenize_words("")  # Empty text
except InvalidTextError as e:
    print(f"Invalid text: {e}")

try:
    words = tokenize_words(None)  # None text
except InvalidTextError as e:
    print(f"Invalid text: {e}")

Command Line Interface

The library includes a comprehensive CLI tool:

# Basic word tokenization (default)
tamil-tokenizer "தமிழ் மொழி அழகான மொழி"

# Sentence tokenization
tamil-tokenizer --method sentences "வணக்கம். நலமா?"

# Character tokenization
tamil-tokenizer --method characters "தமிழ்"

# Show text statistics
tamil-tokenizer --stats "தமிழ் உரை"

# Clean text
tamil-tokenizer --clean "தமிழ்   உரை"

# Clean text and remove punctuation
tamil-tokenizer --clean --remove-punctuation "தமிழ், உரை!"

# JSON output
tamil-tokenizer --json "தமிழ் மொழி"

# Verbose output
tamil-tokenizer --verbose "தமிழ் மொழி"

CLI Examples

# Basic tokenization
$ tamil-tokenizer "தமிழ் மொழி அழகான மொழி"
தமிழ்
மொழி
அழகான
மொழி

# Sentence tokenization with verbose output
$ tamil-tokenizer --method sentences --verbose "வணக்கம். நலமா?"
Tokenization method: sentences
Input text: வணக்கம். நலமா?
Token count: 2
Tokens:
--------------------
  1. வணக்கம்
  2. நலமா

# Text statistics
$ tamil-tokenizer --stats "தமிழ் மொழி"
Total characters: 9
Tamil characters: 8
Words: 2
Sentences: 1
Average word length: 4.00
Average sentence length: 2.00

# JSON output
$ tamil-tokenizer --json "தமிழ் மொழி"
{
  "method": "words",
  "input_text": "தமிழ் மொழி",
  "tokens": ["தமிழ்", "மொழி"],
  "token_count": 2
}

API Reference

Functions

`tokenize_words(text: str) -> List[str]`

Tokenize Tamil text into words.

Parameters:

text: Tamil text to tokenize

Returns: List of word tokens

`tokenize_sentences(text: str) -> List[str]`

Tokenize Tamil text into sentences.

Parameters:

text: Tamil text to tokenize

Returns: List of sentence tokens

`tokenize_characters(text: str) -> List[str]`

Tokenize Tamil text into individual characters.

Parameters:

text: Tamil text to tokenize

Returns: List of character tokens (Tamil characters only)

`clean_text(text: str, remove_punctuation: bool = False) -> str`

Clean Tamil text by normalizing whitespace and optionally removing punctuation.

Parameters:

text: Text to clean
remove_punctuation: Whether to remove non-Tamil punctuation

Returns: Cleaned text

`normalize_text(text: str) -> str`

Normalize Tamil text by cleaning and standardizing format.

Parameters:

text: Text to normalize

Returns: Normalized text

Classes

`TamilTokenizer()`

Main class for Tamil text tokenization operations.

Methods:

tokenize(text, method="words"): General tokenization method
tokenize_words(text): Tokenize into words
tokenize_sentences(text): Tokenize into sentences
tokenize_characters(text): Tokenize into characters
clean_text(text, remove_punctuation=False): Clean text
normalize_text(text): Normalize text
get_statistics(text): Get text statistics

Exceptions

`TamilTokenizerError`

Base exception class for tamil-tokenizer library.

`InvalidTextError`

Raised when invalid text is provided (None, empty, or non-string).

`TokenizationError`

Raised when tokenization fails due to processing errors.

Development

Setup Development Environment

git clone https://github.com/rajacsp/tamil-tokenizer.git
cd tamil-tokenizer
pip install -e ".[dev]"

Run Tests

pytest

Run Tests with Coverage

pytest --cov=tamil_tokenizer --cov-report=html

Code Formatting

black tamil_tokenizer tests examples

Type Checking

mypy tamil_tokenizer

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Changelog

v0.2.0 (2025-01-07)

Enhanced Text Normalization
Comprehensive Unicode normalization (NFC, NFD, NFKC, NFKD)
Tamil digit standardization (௦-௯ to 0-9)
Punctuation standardization and zero-width character removal
Script Information Analysis
Added get_script_info() for comprehensive script analysis
Added detect_language() for language detection with confidence scores
Added is_valid_tamil_text() for Tamil text validation
Character type analysis and complexity scoring
Advanced Features
Language detection with confidence scoring
Text validation with configurable thresholds
Unicode block identification and readability assessment
Enhanced convenience functions with full parameter support

v0.1.1 (2025-01-07)

Enhanced Tamil Tokenization
Added syllable-level tokenization (tokenize_syllables())
Added grapheme cluster tokenization (tokenize_graphemes())
Added word structure analysis (analyze_word_structure())
Improved character tokenization for better Unicode handling
Enhanced text statistics with Tamil-specific metrics
Better support for Tamil conjunct consonants and vowel signs
Advanced Tamil script processing with improved regex patterns
Fixed character tokenization test compatibility
Enhanced tokenize() method to support "syllables" and "graphemes"
Added comprehensive test coverage for new features

v0.1.0 (2025-01-07)

Initial release
Basic Tamil text tokenization (words, sentences, characters)
Text cleaning and normalization
Command-line interface
Comprehensive test suite
Type hints throughout the codebase
Modern Python packaging with pyproject.toml

Tamil Language Support

This library is specifically designed for Tamil text processing and uses Unicode ranges for Tamil script (U+0B80–U+0BFF). It handles:

Tamil characters and diacritics
Common Tamil punctuation
Mixed Tamil-English text (extracts Tamil portions)
Various sentence ending patterns

Acknowledgments

The Tamil language community for inspiration
The Python community for excellent libraries like regex
Contributors and users who help improve this library

Support

If you encounter any issues or have questions, please:

Check the documentation
Search existing issues
Create a new issue if needed

For general questions, you can also reach out via email: raja.csp@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
examples		examples
tamil_tokenizer		tamil_tokenizer
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

kactlabs/tamil-tokenizer

Folders and files

Latest commit

History

Repository files navigation

tamil-tokenizer

Features

Installation

Dependencies

Optional Dependencies

Quick Start

Usage Examples

Basic Tokenization

Using TamilTokenizer Class

Text Cleaning and Normalization

Enhanced Text Normalization

Script Information Analysis

Language Detection

Text Validation

Text Statistics

Error Handling

Command Line Interface

CLI Examples

API Reference

Functions

tokenize_words(text: str) -> List[str]

tokenize_sentences(text: str) -> List[str]

tokenize_characters(text: str) -> List[str]

clean_text(text: str, remove_punctuation: bool = False) -> str

normalize_text(text: str) -> str

Classes

TamilTokenizer()

Exceptions

TamilTokenizerError

InvalidTextError

TokenizationError

Development

Setup Development Environment

Run Tests

Run Tests with Coverage

Code Formatting

Type Checking

Contributing

License

Changelog

v0.2.0 (2025-01-07)

v0.1.1 (2025-01-07)

v0.1.0 (2025-01-07)

Tamil Language Support

Acknowledgments

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

`tokenize_words(text: str) -> List[str]`

`tokenize_sentences(text: str) -> List[str]`

`tokenize_characters(text: str) -> List[str]`

`clean_text(text: str, remove_punctuation: bool = False) -> str`

`normalize_text(text: str) -> str`

`TamilTokenizer()`

`TamilTokenizerError`

`InvalidTextError`

`TokenizationError`

Packages