A library for Thai word segmentation in PHP.
- Thai word segmentation with high accuracy
- Word suggestions for typos and misspellings
- Dictionary loading from local file, remote file, and remote URL
- Performance optimizations with caching and memory management
- Batch processing for large text volumes
- Custom configuration with caching, memory limit, and batch size
- Mixed content support (Thai, English, numbers, punctuation)
- PHP 8.4+
- Composer
You can install the package via composer:
composer require farzai/thai-word
use Farzai\ThaiWord\Composer;
// Simple text segmentation
$words = Composer::segment('สวัสดีครับผมชื่อสมชาย');
// Result: ['สวัสดี', 'ครับ', 'ผม', 'ชื่อ', 'สมชาย']
// Segment with custom delimiter
$text = Composer::segmentToString('สวัสดีครับผมชื่อสมชาย', ' ');
// Result: 'สวัสดี ครับ ผม ชื่อ สมชาย'
// Batch processing for multiple texts
$results = Composer::segmentBatch(['สวัสดีครับ', 'ขอบคุณค่ะ']);
// Result: [['สวัสดี', 'ครับ'], ['ขอบคุณ', 'ค่ะ']]
// Enable word suggestions via facade
// Use threshold 0.4-0.5 for single characters, 0.6-0.7 for multi-character words
Composer::enableSuggestions(['threshold' => 0.5]);
// Get suggestions for misspelled words
$suggestions = Composer::suggest('สวัสด');
// Result: [
// ['word' => 'สวัสดี', 'score' => 0.833],
// ['word' => 'สวัสดิ์', 'score' => 0.714],
// ['word' => 'สวัสติ', 'score' => 0.667]
// ]
// Segment with automatic suggestions for single unrecognized characters
$result = Composer::segmentWithSuggestions('โอเคอไร');
// Result: [
// ['word' => 'โอเค'],
// ['word' => 'อ', 'suggestions' => [
// ['word' => 'กอ', 'score' => 0.5],
// ['word' => 'ขอ', 'score' => 0.5],
// ['word' => 'คอ', 'score' => 0.5]
// ]],
// ['word' => 'ไร']
// ]
// Get performance statistics
$stats = Composer::getStats();
use Farzai\ThaiWord\Segmenter\ThaiSegmenter;
$segmenter = new ThaiSegmenter();
$words = $segmenter->segment('สวัสดีครับผมชื่อสมชาย');
// Result: ['สวัสดี', 'ครับ', 'ผม', 'ชื่อ', 'สมชาย']
use Farzai\ThaiWord\Segmenter\ThaiSegmenter;
$segmenter = new ThaiSegmenter();
// Enable word suggestions
$segmenter->enableSuggestions([
'threshold' => 0.5, // Minimum similarity score (0.0-1.0)
'max_suggestions' => 5 // Maximum suggestions per word
]);
// Get suggestions for a misspelled word
$suggestions = $segmenter->suggest('สวัสด'); // Missing last character
// Result: [
// ['word' => 'สวัสดี', 'score' => 0.833],
// ['word' => 'สวัสดิ์', 'score' => 0.714],
// ['word' => 'สวัสติ', 'score' => 0.667]
// ]
// Segment text with automatic suggestions for single unrecognized characters
$result = $segmenter->segmentWithSuggestions('ชื่ออไรนะ'); // 'อ' is unrecognized single character
// Result: [
// ['word' => 'ชื่อ'],
// ['word' => 'อ', 'suggestions' => [
// ['word' => 'กอ', 'score' => 0.5],
// ['word' => 'ขอ', 'score' => 0.5],
// ['word' => 'คอ', 'score' => 0.5]
// ]],
// ['word' => 'ไร'],
// ['word' => 'นะ']
// ]
This library segments Thai text into words and provides intelligent word suggestions through a highly optimized process. Here's how it works step by step:
- You provide Thai text as a string to the
ThaiSegmenter
- Example:
'สวัสดีครับผมชื่อสมชาย'
- The library validates UTF-8 encoding and handles empty strings
The library automatically loads Thai words using several sources with intelligent fallback:
- LibreOffice Thai Dictionary: Downloads from official LibreOffice repository (primary source)
- Local Dictionary Files: Falls back to local dictionary files if available
- Basic Dictionary: Uses built-in common Thai words as last resort
The dictionary is stored in a HashDictionary
with O(1) lookup performance.
The LongestMatchingStrategy
algorithm processes text intelligently:
Character Classification:
- Thai characters: Unicode range 0x0E00-0x0E7F for fast detection
- English words: Handled as complete word units
- Numbers: Processed as number sequences (with decimals, commas)
- Punctuation: Handled appropriately with whitespace normalization
Input: สวัสดีครับผมชื่อสมชาย
↓
Position 0: Check สวัสดี (6 chars) → Found in dictionary ✓
Position 6: Check ครับ (4 chars) → Found in dictionary ✓
Position 10: Check ผม (2 chars) → Found in dictionary ✓
Position 12: Check ชื่อ (3 chars) → Found in dictionary ✓
Position 15: Check สมชาย (5 chars) → Found in dictionary ✓
↓
Output: ['สวัสดี', 'ครับ', 'ผม', 'ชื่อ', 'สมชาย']
When enabled, the library can suggest corrections for typos using advanced similarity algorithms:
Levenshtein Distance Algorithm:
Input: สวัสด (missing last character)
↓
1. Filter dictionary words by length similarity (±3 characters)
2. Calculate Unicode-aware Levenshtein distance for each candidate
3. Convert distance to similarity score (0.0 to 1.0)
4. Filter by threshold (default 0.6) and sort by score
↓
Output: [
['word' => 'สวัสดี', 'score' => 0.833], // 1 character difference
['word' => 'สวัสดิ์', 'score' => 0.714], // 2 character difference
['word' => 'สวัสติ', 'score' => 0.667] // 2 character difference
]
Smart Suggestion Integration:
- Single-character only:
segmentWithSuggestions()
only provides suggestions for single-character segments that are NOT in the dictionary - Multi-character words: Use
suggest()
method directly for multi-character word suggestions - Threshold requirements: Single-character similarities max out at 0.5, so use threshold ≤ 0.5 for best results
- Configurable similarity thresholds: 0.4-0.5 for single characters, 0.6-0.7 for multi-character words
- Performance-optimized: Caching and length-based filtering for large dictionaries
- Unicode-aware for proper Thai character handling
The library includes several optimizations:
- Caching: Recently segmented texts are cached for faster repeat processing
- Batch Processing: Large texts are processed in chunks to manage memory
- Memory Management: Automatic garbage collection and memory optimization
- Adaptive Processing: Different strategies for short, medium, and long texts
- Suggestion Caching: Distance calculations cached for repeated similarity checks
$segmenter = new ThaiSegmenter();
$result = $segmenter->segment('ผมใช้ Computer ทำงาน');
// Result: ['ผม', 'ใช้', 'Computer', 'ทำงาน']
- Thai words are processed with dictionary lookup
- English words are kept as complete units
- Numbers and punctuation are handled appropriately
- ThaiSegmenter: Main orchestrator with performance monitoring and suggestion integration
- HashDictionary: O(1) hash-based word lookup with 70% less memory usage than trie structures
- LongestMatchingStrategy: Optimized algorithm with character classification
- LevenshteinSuggestionStrategy: Unicode-aware word suggestion algorithm with caching
- DictionaryLoaderService: Handles loading from files, URLs, and remote sources
- 3-5x faster processing speed with optimized algorithms
- 50% lower memory usage with hash-based dictionary
- Intelligent suggestions with configurable accuracy thresholds
- Automatic optimization based on text characteristics
- Built-in statistics for performance monitoring
Using the Facade (Simple & Clean)
use Farzai\ThaiWord\Composer;
// Basic segmentation
$words = Composer::segment('สวัสดีครับผมชื่อสมชาย');
// Result: ['สวัสดี', 'ครับ', 'ผม', 'ชื่อ', 'สมชาย']
// Get performance statistics
$stats = Composer::getStats();
echo "Processing time: {$stats['avg_processing_time']}ms";
// Add custom words
Composer::getDictionary()->add('คำใหม่');
// Batch processing for multiple texts
$results = Composer::segmentBatch(['ข้อความ1', 'ข้อความ2']);
// Custom configuration
Composer::updateConfig([
'enable_caching' => true,
'memory_limit_mb' => 200
]);
Using ThaiSegmenter Directly (Advanced Control)
use Farzai\ThaiWord\Segmenter\ThaiSegmenter;
// Create segmenter with custom configuration
$segmenter = new ThaiSegmenter(null, null, [
'enable_caching' => true,
'batch_size' => 500
]);
// Or use the facade to create custom instances
$customSegmenter = Composer::create(null, null, ['memory_limit_mb' => 150]);
// Set custom segmenter for facade
Composer::setSegmenter($customSegmenter);
This architecture ensures both accuracy and performance while remaining simple to use.
use Farzai\ThaiWord\Segmenter\ThaiSegmenter;
use Farzai\ThaiWord\Suggestions\Strategies\LevenshteinSuggestionStrategy;
// Create custom suggestion strategy
$suggestionStrategy = new LevenshteinSuggestionStrategy;
$suggestionStrategy->setThreshold(0.8) // Higher accuracy
->setMaxWordLengthDiff(2); // Stricter length filtering
// Initialize segmenter with custom strategy
$segmenter = new ThaiSegmenter(null, null, $suggestionStrategy);
// Or set strategy later
$segmenter->setSuggestionStrategy($suggestionStrategy);
$segmenter = new ThaiSegmenter();
$segmenter->enableSuggestions();
// Process text
$result = $segmenter->segmentWithSuggestions('สวัสดีครบผมชื่อโจน');
// Get detailed statistics
$stats = $segmenter->getStats();
echo "Cache hit ratio: " . ($stats['cache_hit_ratio'] * 100) . "%\n";
// Get suggestion-specific statistics
$suggestionStrategy = $segmenter->getSuggestionStrategy();
if ($suggestionStrategy instanceof LevenshteinSuggestionStrategy) {
$cacheStats = $suggestionStrategy->getCacheStats();
echo "Suggestion cache size: " . $cacheStats['cache_size'] . "\n";
echo "Memory usage: " . $cacheStats['memory_usage_mb'] . "MB\n";
}
$texts = [
'สวัสดีครบ', // Contains typo
'ขอบคนครับ', // Contains typo
'ผมชื่อโจน' // Might need suggestions
];
$segmenter = new ThaiSegmenter();
$segmenter->enableSuggestions(['threshold' => 0.7]);
foreach ($texts as $text) {
$result = $segmenter->segmentWithSuggestions($text);
foreach ($result as $item) {
if (isset($item['suggestions'])) {
echo "'{$item['word']}' → Suggested: '{$item['suggestions'][0]['word']}'\n";
}
}
}
// Example output:
// 'ครบ' → Suggested: 'ครับ'
// 'คน' → Suggested: 'คุณ'
// 'โจน' → Suggested: 'โจ้'
Important: The segmentWithSuggestions()
method only provides suggestions for single-character segments that are NOT found in the dictionary.
$segmenter = new ThaiSegmenter();
$segmenter->enableSuggestions(['threshold' => 0.5]);
// ✅ Will get suggestions - 'อ' is single character not in dictionary
$result = $segmenter->segmentWithSuggestions('โอเคอไร');
// 'อ' gets suggestions: ['กอ', 'ขอ', 'คอ', ...]
// ❌ Won't get suggestions - 'ครบ' is multi-character and in dictionary
$result = $segmenter->segmentWithSuggestions('สวัสดีครบ');
// 'ครบ' gets NO suggestions (even though 'ครับ' might be intended)
// ✅ For multi-character suggestions, use suggest() directly
$suggestions = $segmenter->suggest('ครบ');
// Returns: ['ครับ', 'ครอบ', 'คราบ', ...]
Threshold Guidelines:
- Single characters: Use 0.4-0.5 (similarities max out at 0.5)
- Multi-character words: Use 0.6-0.7 (higher precision possible)
$segmenter = new ThaiSegmenter();
// Enable suggestions with proper threshold for single characters
$segmenter->enableSuggestions([
'threshold' => 0.5, // Optimal for single characters
'max_suggestions' => 3 // Maximum suggestions per word
]);
// Update segmenter configuration
$segmenter->updateConfig([
'enable_caching' => true,
'memory_limit_mb' => 150,
'suggestion_threshold' => 0.5, // Adjusted for single characters
'max_suggestions' => 5
]);
// Disable suggestions when not needed
$segmenter->disableSuggestions();
composer test
Please see CHANGELOG for more information on what has changed recently.
Please see CONTRIBUTING for details.
Please review our security policy on how to report security vulnerabilities.
- LibreOffice Thai Dictionary - Primary Thai word dictionary source
The MIT License (MIT). Please see License File for more information.