Skip to content

kactlabs/tamil-tokenizer

Repository files navigation

tamil-tokenizer

A simple and efficient Tamil text tokenizer library with modern Python structure.

Python Support License: Apache-2.0

Features

  • Tamil Text Tokenization: Comprehensive tokenization for Tamil text
  • Multiple Tokenization Methods: Word, sentence, character, syllable, and grapheme-level tokenization
  • Enhanced Text Normalization: Unicode normalization, digit standardization, punctuation standardization
  • Script Information Analysis: Comprehensive script detection, complexity scoring, and readability assessment
  • Language Detection: Automatic Tamil language detection with confidence scores
  • Text Validation: Tamil text validation with configurable thresholds
  • Character Type Analysis: Detailed analysis of vowels, consonants, conjuncts, and other character types
  • Modern Python API: Clean, type-hinted interface with both functional and object-oriented approaches
  • Command Line Interface: Full-featured CLI tool for Tamil text processing
  • Fast Processing: Efficient regex-based operations
  • Error Handling: Comprehensive exception handling with meaningful error messages
  • Well Tested: Extensive test suite with high coverage
  • Type Hints: Full type annotation support for better IDE experience

Installation

pip install tamil-tokenizer

Dependencies

  • Python 3.8+
  • regex >= 2022.0.0

Optional Dependencies

For development:

pip install tamil-tokenizer[dev]

Quick Start

from tamil_tokenizer import tokenize_words, tokenize_sentences, TamilTokenizer

# Quick tokenization
words = tokenize_words("தமிழ் மொழி அழகான மொழி")
print(f"Words: {words}")

sentences = tokenize_sentences("வணக்கம். நீங்கள் எப்படி இருக்கிறீர்கள்?")
print(f"Sentences: {sentences}")

# Using TamilTokenizer class
tokenizer = TamilTokenizer()
tokens = tokenizer.tokenize("தமிழ் உரை", method="words")
print(f"Tokens: {tokens}")

Usage Examples

Basic Tokenization

from tamil_tokenizer import tokenize_words, tokenize_sentences, tokenize_characters

# Word tokenization
text = "தமிழ் மொழி அழகான மொழி"
words = tokenize_words(text)
print(f"Words: {words}")
# Output: ['தமிழ்', 'மொழி', 'அழகான', 'மொழி']

# Sentence tokenization
text = "வணக்கம். நீங்கள் எப்படி இருக்கிறீர்கள்? நன்றாக இருக்கிறேன்!"
sentences = tokenize_sentences(text)
print(f"Sentences: {sentences}")
# Output: ['வணக்கம்', 'நீங்கள் எப்படி இருக்கிறீர்கள்', 'நன்றாக இருக்கிறேன்']

# Character tokenization
text = "தமிழ்"
characters = tokenize_characters(text)
print(f"Characters: {characters}")
# Output: ['த', 'ம', 'ி', 'ழ', '்']

Using TamilTokenizer Class

from tamil_tokenizer import TamilTokenizer

# Create tokenizer instance
tokenizer = TamilTokenizer()

# General tokenization method
text = "தமிழ் மொழி அழகான மொழி"
words = tokenizer.tokenize(text, method="words")
sentences = tokenizer.tokenize(text, method="sentences")
characters = tokenizer.tokenize(text, method="characters")

print(f"Words: {words}")
print(f"Sentences: {sentences}")
print(f"Characters: {characters}")

Text Cleaning and Normalization

from tamil_tokenizer import clean_text, normalize_text, TamilTokenizer

# Clean text with extra whitespace
messy_text = "  தமிழ்   மொழி   அழகு  "
cleaned = clean_text(messy_text)
print(f"Cleaned: '{cleaned}'")
# Output: 'தமிழ் மொழி அழகு'

# Clean text and remove punctuation
tokenizer = TamilTokenizer()
text_with_punct = "தமிழ், மொழி! அழகு?"
cleaned_no_punct = tokenizer.clean_text(text_with_punct, remove_punctuation=True)
print(f"No punctuation: '{cleaned_no_punct}'")
# Output: 'தமிழ் மொழி அழகு'

# Normalize text
normalized = normalize_text(messy_text)
print(f"Normalized: '{normalized}'")
# Output: 'தமிழ் மொழி அழகு'

Enhanced Text Normalization

from tamil_tokenizer import normalize_text, TamilTokenizer

tokenizer = TamilTokenizer()

# Comprehensive normalization with all options
text = "  தமிழ்—௧௨௩\u200Cமொழி…அழகான—மொழி  "
normalized = tokenizer.normalize_text(
    text,
    form="NFC",                    # Unicode normalization
    standardize_digits=True,       # Tamil digits to Arabic
    standardize_punctuation=True,  # Standardize punctuation
    remove_zero_width=True         # Remove invisible characters
)
print(f"Normalized: '{normalized}'")
# Output: 'தமிழ்-123மொழி...அழகான-மொழி'

# Tamil digit standardization
text_with_digits = "தமிழ் ௧௨௩௪ வருடங்கள் பழமையான மொழி"
standardized = normalize_text(text_with_digits, standardize_digits=True)
print(f"Standardized: {standardized}")
# Output: 'தமிழ் 1234 வருடங்கள் பழமையான மொழி'

Script Information Analysis

from tamil_tokenizer import get_script_info, TamilTokenizer

tokenizer = TamilTokenizer()

# Comprehensive script analysis
text = "தமிழ் மொழி உலகின் பழமையான மொழிகளில் ஒன்று"
info = tokenizer.get_script_info(text)

print(f"Tamil percentage: {info['tamil_percentage']:.1f}%")
print(f"Complexity score: {info['complexity_score']:.2f}")
print(f"Readability level: {info['readability_level']}")
print(f"Scripts detected: {info['scripts_detected']}")
print(f"Has conjuncts: {info['has_conjuncts']}")
print(f"Unicode blocks: {info['unicode_blocks']}")

# Character type analysis
char_types = info['character_types']
print(f"Vowels: {char_types['vowels']}")
print(f"Consonants: {char_types['consonants']}")
print(f"Vowel signs: {char_types['vowel_signs']}")

Language Detection

from tamil_tokenizer import detect_language, TamilTokenizer

tokenizer = TamilTokenizer()

# Detect language with confidence
texts = [
    "தமிழ் மொழி அழகான மொழி",
    "தமிழ் Tamil மொழி Language",
    "Hello World English Text"
]

for text in texts:
    result = tokenizer.detect_language(text)
    print(f"Text: {text}")
    print(f"Language: {result['primary_language']}")
    print(f"Confidence: {result['confidence']:.2f}")
    print(f"Is Tamil: {result['is_tamil']}")
    print("---")

Text Validation

from tamil_tokenizer import is_valid_tamil_text, TamilTokenizer

tokenizer = TamilTokenizer()

# Validate Tamil text with different thresholds
texts = [
    "தமிழ் மொழி அழகான மொழி",
    "தமிழ் Tamil மொழி",
    "Hello World"
]

for text in texts:
    # Default threshold (50%)
    is_valid_default = tokenizer.is_valid_tamil_text(text)
    
    # Strict threshold (80%)
    is_valid_strict = tokenizer.is_valid_tamil_text(text, min_tamil_percentage=80.0)
    
    print(f"Text: {text}")
    print(f"Valid (50%): {is_valid_default}")
    print(f"Valid (80%): {is_valid_strict}")
    print("---")

Text Statistics

from tamil_tokenizer import TamilTokenizer

tokenizer = TamilTokenizer()
text = "தமிழ் மொழி அழகான மொழி. இது உலகின் பழமையான மொழிகளில் ஒன்று!"

stats = tokenizer.get_statistics(text)
print(f"Total characters: {stats['total_characters']}")
print(f"Tamil characters: {stats['tamil_characters']}")
print(f"Words: {stats['words']}")
print(f"Sentences: {stats['sentences']}")
print(f"Average word length: {stats['average_word_length']:.2f}")
print(f"Average sentence length: {stats['average_sentence_length']:.2f}")

Error Handling

from tamil_tokenizer import tokenize_words
from tamil_tokenizer.exceptions import InvalidTextError, TokenizationError

try:
    words = tokenize_words("")  # Empty text
except InvalidTextError as e:
    print(f"Invalid text: {e}")

try:
    words = tokenize_words(None)  # None text
except InvalidTextError as e:
    print(f"Invalid text: {e}")

Command Line Interface

The library includes a comprehensive CLI tool:

# Basic word tokenization (default)
tamil-tokenizer "தமிழ் மொழி அழகான மொழி"

# Sentence tokenization
tamil-tokenizer --method sentences "வணக்கம். நலமா?"

# Character tokenization
tamil-tokenizer --method characters "தமிழ்"

# Show text statistics
tamil-tokenizer --stats "தமிழ் உரை"

# Clean text
tamil-tokenizer --clean "தமிழ்   உரை"

# Clean text and remove punctuation
tamil-tokenizer --clean --remove-punctuation "தமிழ், உரை!"

# JSON output
tamil-tokenizer --json "தமிழ் மொழி"

# Verbose output
tamil-tokenizer --verbose "தமிழ் மொழி"

CLI Examples

# Basic tokenization
$ tamil-tokenizer "தமிழ் மொழி அழகான மொழி"
தமிழ்
மொழி
அழகான
மொழி

# Sentence tokenization with verbose output
$ tamil-tokenizer --method sentences --verbose "வணக்கம். நலமா?"
Tokenization method: sentences
Input text: வணக்கம். நலமா?
Token count: 2
Tokens:
--------------------
  1. வணக்கம்
  2. நலமா

# Text statistics
$ tamil-tokenizer --stats "தமிழ் மொழி"
Total characters: 9
Tamil characters: 8
Words: 2
Sentences: 1
Average word length: 4.00
Average sentence length: 2.00

# JSON output
$ tamil-tokenizer --json "தமிழ் மொழி"
{
  "method": "words",
  "input_text": "தமிழ் மொழி",
  "tokens": ["தமிழ்", "மொழி"],
  "token_count": 2
}

API Reference

Functions

tokenize_words(text: str) -> List[str]

Tokenize Tamil text into words.

Parameters:

  • text: Tamil text to tokenize

Returns: List of word tokens

tokenize_sentences(text: str) -> List[str]

Tokenize Tamil text into sentences.

Parameters:

  • text: Tamil text to tokenize

Returns: List of sentence tokens

tokenize_characters(text: str) -> List[str]

Tokenize Tamil text into individual characters.

Parameters:

  • text: Tamil text to tokenize

Returns: List of character tokens (Tamil characters only)

clean_text(text: str, remove_punctuation: bool = False) -> str

Clean Tamil text by normalizing whitespace and optionally removing punctuation.

Parameters:

  • text: Text to clean
  • remove_punctuation: Whether to remove non-Tamil punctuation

Returns: Cleaned text

normalize_text(text: str) -> str

Normalize Tamil text by cleaning and standardizing format.

Parameters:

  • text: Text to normalize

Returns: Normalized text

Classes

TamilTokenizer()

Main class for Tamil text tokenization operations.

Methods:

  • tokenize(text, method="words"): General tokenization method
  • tokenize_words(text): Tokenize into words
  • tokenize_sentences(text): Tokenize into sentences
  • tokenize_characters(text): Tokenize into characters
  • clean_text(text, remove_punctuation=False): Clean text
  • normalize_text(text): Normalize text
  • get_statistics(text): Get text statistics

Exceptions

TamilTokenizerError

Base exception class for tamil-tokenizer library.

InvalidTextError

Raised when invalid text is provided (None, empty, or non-string).

TokenizationError

Raised when tokenization fails due to processing errors.

Development

Setup Development Environment

git clone https://github.com/rajacsp/tamil-tokenizer.git
cd tamil-tokenizer
pip install -e ".[dev]"

Run Tests

pytest

Run Tests with Coverage

pytest --cov=tamil_tokenizer --cov-report=html

Code Formatting

black tamil_tokenizer tests examples

Type Checking

mypy tamil_tokenizer

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Changelog

v0.2.0 (2025-01-07)

  • Enhanced Text Normalization
  • Comprehensive Unicode normalization (NFC, NFD, NFKC, NFKD)
  • Tamil digit standardization (௦-௯ to 0-9)
  • Punctuation standardization and zero-width character removal
  • Script Information Analysis
  • Added get_script_info() for comprehensive script analysis
  • Added detect_language() for language detection with confidence scores
  • Added is_valid_tamil_text() for Tamil text validation
  • Character type analysis and complexity scoring
  • Advanced Features
  • Language detection with confidence scoring
  • Text validation with configurable thresholds
  • Unicode block identification and readability assessment
  • Enhanced convenience functions with full parameter support

v0.1.1 (2025-01-07)

  • Enhanced Tamil Tokenization
  • Added syllable-level tokenization (tokenize_syllables())
  • Added grapheme cluster tokenization (tokenize_graphemes())
  • Added word structure analysis (analyze_word_structure())
  • Improved character tokenization for better Unicode handling
  • Enhanced text statistics with Tamil-specific metrics
  • Better support for Tamil conjunct consonants and vowel signs
  • Advanced Tamil script processing with improved regex patterns
  • Fixed character tokenization test compatibility
  • Enhanced tokenize() method to support "syllables" and "graphemes"
  • Added comprehensive test coverage for new features

v0.1.0 (2025-01-07)

  • Initial release
  • Basic Tamil text tokenization (words, sentences, characters)
  • Text cleaning and normalization
  • Command-line interface
  • Comprehensive test suite
  • Type hints throughout the codebase
  • Modern Python packaging with pyproject.toml

Tamil Language Support

This library is specifically designed for Tamil text processing and uses Unicode ranges for Tamil script (U+0B80–U+0BFF). It handles:

  • Tamil characters and diacritics
  • Common Tamil punctuation
  • Mixed Tamil-English text (extracts Tamil portions)
  • Various sentence ending patterns

Acknowledgments

  • The Tamil language community for inspiration
  • The Python community for excellent libraries like regex
  • Contributors and users who help improve this library

Support

If you encounter any issues or have questions, please:

  1. Check the documentation
  2. Search existing issues
  3. Create a new issue if needed

For general questions, you can also reach out via email: raja.csp@gmail.com

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages