A sophisticated multi-language subtitle processing tool that implements Netflix-compliant subtitle standards with intelligent line breaking, enhanced bilingual validation, and default SDH removal for improved readability.
- Multi-language Support: Chinese, English, Korean, Japanese with auto-detection
- Netflix Standard Compliance: Character limits and reading speeds per Netflix guidelines
- Intelligent Line Breaking: Context-aware breaking with language-specific rules
- Smart SDH Processing: Default removal of audio/music markers for cleaner subtitles (v2.6)
- Default SDH Removal: Automatically removes audio/music markers for cleaner subtitles
- Unicode SDH Support: Handles both ASCII
()
and full-width๏ผ๏ผ
parentheses in multilingual content - Smart Content Preservation: Preserves dialogue while removing SDH markers from mixed content
- Flexible SDH Control: Use
--keep-sdh
flag when SDH markers are needed
- Bilingual Validation Fix: Per-line language detection for accurate character limit validation
- Language-Specific Violations: Enhanced violation output showing language codes and specific limits
- Violation Output Export: New
--output-violation
parameter to save violations to separate SRT files - Reading Speed Analysis: Improved bilingual content reading speed calculation and reporting
- Mixed-Language Support: Correct handling of Chinese-English, Korean-English mixed content
- Validation-Only Mode: New
--check-only
parameter for compliance checking without processing - Detailed Validation Reports: Comprehensive violation analysis with categorized warnings
- Batch Validation: Check multiple files simultaneously with summary statistics
- Enhanced CLI Output: Improved verbose mode with validation status indicators
- Compliance Scoring: Visual compliance rate indicators (โ
โ ๏ธ โ) based on violation percentages
- Complete Korean Language Support: Full Korean processor implementation with dialogue formatting and intelligent line breaking
- Korean Dialogue Formatting: Proper spacing after Korean dialogue markers (e.g.,
-์ฌ๊ธฐ ์๋ค
โ- ์ฌ๊ธฐ ์๋ค
) - Korean Particle Detection: Intelligent line breaking at Korean particles and endings (์/๋, ์ด/๊ฐ, ์/๋ฅผ, etc.)
- Korean Text Validation: Proper character counting and reading speed validation for Korean content
- Bilingual KO-CN Support: Enhanced processing for Korean-Chinese mixed content
- Dialogue Format Optimization: Auto-add spaces after "-" markers
- Bilingual Processing: Intelligent handling of mixed Chinese-English content
- Smart English Line Merging: Merge short continuation lines (e.g., "Do that, then.")
- Chinese Punctuation Intelligence: Prevent unwanted periods in sentence continuations
- Smart Threshold Detection:
- Chinese: Don't break if remaining < 3 characters
- English: Don't break if remaining < 4 complete words or creating lines < 20 chars
- Minimum Line Length: Ensure Chinese post-break lines โฅ 5 characters
- SDH Audio Merging: Combine repeated audio markers (โชโช,โชโช)
- Context-Aware Punctuation: Add missing sentence-ending punctuation only for complete sentences
- Assistant Word Breaking: Optimize breaks at Chinese helper words (็ใๅฐใๅพ)
Language | Character Limit | SDH Limit | Reading Speed (Adult/Children) |
---|---|---|---|
Chinese | 16 | 18 | 9/7 chars/sec |
English | 42 | 42 | 20/17 chars/sec |
Korean | 16 | 16 | 12/9 chars/sec |
Japanese | 13 | 16 | 4/4 chars/sec (SDH: 7/7) |
# Set up & activate virtual environment
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
uv pip install -r requirements.txt
# Install development dependencies (optional)
uv pip install -r requirements-dev.txt
# Exit when done
deactivate
# Process single file with default SDH removal (output auto-generated if not specified)
python src/main.py input.srt [output.srt]
# Using module syntax (alternative)
python -m src.srt_processor.cli input.srt [output.srt]
# Keep SDH markers (disable default removal)
python src/main.py input.srt --keep-sdh
# Auto-detect language (default)
python src/main.py input.srt --language auto
# Specify language explicitly
python src/main.py input.srt --language ko # Korean
python src/main.py input.srt --language zh # Chinese
python src/main.py input.srt --language en # English
# Enable SDH mode with SDH markers preserved
python src/main.py input.srt --sdh --keep-sdh
# Disable speed checking (useful for development/testing)
python src/main.py input.srt --no-speed-check
# Disable punctuation correction
python src/main.py input.srt --no-punct-fix
# Batch process directory (default SDH removal)
python src/main.py --batch /path/to/srt/files
# Batch process directory keeping SDH markers
python src/main.py --batch /path/to/srt/files --keep-sdh
# Verbose output with detailed processing info
python src/main.py input.srt --verbose
# Validation-only mode with default SDH removal
python src/main.py input.srt --check-only
# Validation with SDH markers preserved
python src/main.py input.srt --check-only --keep-sdh
# Validation with speed checking disabled
python src/main.py input.srt --check-only --no-speed-check
# Export violations to separate file
python src/main.py input.srt --check-only --output-violation
# Export violations with custom filename
python src/main.py input.srt --check-only --output-violation violations.srt
# Batch validation for quality assurance (default SDH removal)
python src/main.py --batch /path/to/srt/files --check-only
# Batch validation keeping SDH markers
python src/main.py --batch /path/to/srt/files --check-only --keep-sdh
# Combine options
python src/main.py input.srt --language ko --verbose --no-speed-check --keep-sdh
from src.srt_processor.core.processor import SRTProcessor
from src.srt_processor.models.subtitle import ProcessingConfig, Language
# Create configuration with default SDH removal
config = ProcessingConfig(
language=Language.AUTO,
sdh_mode=False,
no_punct_fix=False,
remove_sdh=True # Default in v2.6
)
# Create configuration keeping SDH markers
config_keep_sdh = ProcessingConfig(
language=Language.AUTO,
sdh_mode=False,
no_punct_fix=False,
remove_sdh=False # Explicitly disable SDH removal
)
# Process file with default SDH removal
processor = SRTProcessor(config)
result = processor.process_file("input.srt", "output.srt")
# Process file keeping SDH markers
processor_keep_sdh = SRTProcessor(config_keep_sdh)
result_keep_sdh = processor_keep_sdh.process_file("input_with_sdh.srt", "output_with_sdh.srt")
# Run all tests
pytest
# Run with coverage
pytest --cov=src --cov-report=html
# Run specific test
pytest tests/test_processors.py
# Format code
black .
# Sort imports
isort .
# Lint code
flake8
# Type checking
mypy src/
# Run demo with sample data
python demo.py
src/
โโโ srt_processor/
โ โโโ models/ # Data models and configurations
โ โโโ core/ # Parsing, language detection, main processor
โ โโโ processors/ # Language-specific processors
โ โโโ utils/ # Utility functions
โ โโโ cli.py # Command line interface
tests/ # Comprehensive test suite
demo.py # Demonstration script
- Threshold Rule: Don't break if remaining < 3 characters
- Minimum Length: Ensure post-break lines โฅ 5 characters
- Helper Word Breaking: Prioritize breaks at ็ใๅฐใๅพใไบใๅงใๅข
- Smart Punctuation: Add missing "ใ" only for complete sentences, not continuations
- Continuation Detection: Identify sentence continuations to prevent unwanted punctuation
- Enhanced Word Threshold: Don't break if remaining < 4 complete words or creating lines < 20 chars
- Smart Line Merging: Automatically merge short continuation lines in bilingual content
- Grammar Optimization: Break before conjunctions and prepositions
- Word Boundary: Maintain complete word integrity
- Dialogue Continuation: Merge dialogue lines with non-dialogue continuations
- Particle Detection: Intelligent breaking at Korean particles (์/๋, ์ด/๊ฐ, ์/๋ฅผ, ์์, ๋ก, ๊ณผ/์, etc.)
- Dialogue Formatting: Automatic spacing adjustment for Korean dialogue markers
- Minimum Length: Ensure post-break lines โฅ 4 characters for Korean readability
- Space Preservation: Maintain word boundaries for Korean mixed-script content
- Continuation Patterns: Detect Korean connectors (๊ณ , ์, ๋ฉด, ๋ฉฐ, ๋๋ฐ, ์ง๋ง, ํ๊ณ , ๊ฐ์ง๊ณ )
- Marker Detection: Identify โช, [audio], (sound) patterns
- Auto Merging: Combine repeated markers with comma separation
- Enhanced Limits: Apply SDH-specific character limits
- Netflix Compliance: 100% adherence to official subtitle standards
- Multi-language: Automatic detection with 95%+ accuracy
- Processing Speed: Optimized for batch processing large subtitle libraries
- Memory Efficient: Streaming processing for large files
This project maintains high code quality and security standards through automated checks:
- Formatting: Code formatted with Black (88-character line length)
- Import Sorting: Imports organized with isort
- Linting: Code quality checked with flake8
- Complexity: Maximum cyclomatic complexity of 10
- Vulnerability Detection: Dependencies scanned with Safety
- Security Linting: Source code analyzed with Bandit
- Automated Reports: Security findings uploaded as CI artifacts
- Quality Check: Automated formatting and security scanning on every commit
- Basic Functionality: CLI functionality tested with sample files
- Cross-Platform: Compatible with Linux, macOS, and Windows
All checks must pass before code can be merged to the main branch.
- Follow existing code style (Black, isort, flake8)
- Add comprehensive tests for new features
- Update documentation for API changes
- Ensure Netflix standard compliance
- All CI checks must pass
This implementation follows Netflix's publicly available subtitle standards and best practices for accessibility and international content distribution.
Parameter | Description | Default |
---|---|---|
--language |
Target language (auto/zh/en/ko/ja) | auto |
--content-type |
Adult or children content | adult |
--sdh |
Enable SDH mode | false |
--keep-sdh |
Keep SDH markers (disable default removal) (NEW v2.6) | false |
--no-speed-check |
Disable reading speed validation | false |
--no-punct-fix |
Disable auto punctuation | false |
--force-encoding |
Override output encoding | auto-detect |
--verbose |
Enable detailed output | false |
--check-only |
Validate without processing | false |
--output-violation |
Export violations to file | false |
Note: As of v2.6, SDH removal is enabled by default to improve subtitle readability. Use --keep-sdh
when SDH markers are required.
Processing with Default SDH Removal:
$ python src/main.py bilingual_with_sdh.srt
Processing: bilingual_with_sdh.srt
Language detected: zh
SDH removal: Enabled (default)
Removed 15 SDH-only blocks
Cleaned SDH markers from 8 mixed content blocks
Processed: bilingual_with_sdh.srt -> bilingual_with_sdh_processed.srt
Processing with SDH Markers Preserved:
$ python src/main.py bilingual_with_sdh.srt --keep-sdh
Processing: bilingual_with_sdh.srt
Language detected: zh
SDH removal: Disabled (--keep-sdh)
Processed: bilingual_with_sdh.srt -> bilingual_with_sdh_processed.srt
Before (with SDH markers):
1
00:00:04,967 --> 00:00:07,467
โชโช
2
00:00:07,600 --> 00:00:14,333
๏ผ้ณไนๅ่ตท๏ผ
(MUSIC PLAYS)
3
00:01:11,733 --> 00:01:12,800
- ไฝ ๅฅฝ๏ผ๏ผ็ฌๅฃฐ๏ผ
- Hello! (LAUGHTER)
After (v2.6 default behavior):
1
00:01:11,733 --> 00:01:12,800
- ไฝ ๅฅฝ๏ผ
- Hello!
Before v2.5 (Incorrect):
$ python src/main.py bilingual.srt --check-only
Block 65: Line 2 exceeds character limit (20 > 16) # โ English text wrongly validated against Chinese limit
After v2.5 (Correct):
$ python src/main.py bilingual.srt --check-only
Block 65: Reading speed too fast (17.5 > 9.0 chars/sec) # โ
Only legitimate violations shown
Violation Export (NEW v2.5):
$ python src/main.py bilingual.srt --check-only --output-violation
Violations saved to: bilingual-violation.srt
# Content of bilingual-violation.srt:
1
00:00:01,000 --> 00:00:03,000
# VIOLATIONS SUMMARY
# Reading Speed Violations: 1
# Character Limit Violations: 0
# Language Detection: Chinese (primary)
2
00:04:31,667 --> 00:04:33,333
-ไธบไปไน่ฎฉๅฅน้ ่ฟๅฐธไฝ๏ผ
-Why did you let her
near the body?
# VIOLATIONS: Reading speed (17.5 > 9.0 chars/sec)
Single File Validation:
$ python src/main.py samples/CHS-KOR.srt --check-only
Checking: samples/CHS-KOR.srt
Language detected: ko
Total blocks: 1680
=== VALIDATION REPORT ===
Character Limit Violations: 1026
๐ Block 1: Exceeds character limit (34 > 16)
๐ Block 2: Exceeds character limit (19 > 16)
... and 1016 more character limit violations
Reading Speed Violations: 905
โฑ๏ธ Block 1: Reading speed too fast (16.6 > 9.0 chars/sec)
โฑ๏ธ Block 2: Reading speed too fast (13.8 > 12.0 chars/sec)
... and 895 more speed violations
=== SUMMARY ===
โ Compliance: 30.4% (511/1680 blocks)
โ ๏ธ Total Violations: 1931
๐ Character Limit: 1026 violations
โฑ๏ธ Reading Speed: 905 violations
Batch Validation:
$ python src/main.py --batch samples --check-only
Checking 15 SRT files in samples
โ CHS-KOR.srt - 30.4% (1931 violations)
โ ๏ธ Phanteam.chs-kor.srt - 80.6% (253 violations)
โ
Bouquet.CHS.srt - 95.8% (72 violations)
Batch checking complete:
Checked: 15
Total violations: 18723
Average violations per file: 1248.2
Korean Dialogue Formatting (New in v2.3):
# Before
-์ฌ๊ธฐ ์๋ค
-์, ๊น์ง์ด์ผ
# After
- ์ฌ๊ธฐ ์๋ค
- ์, ๊น์ง์ด์ผ.
English Line Merging (Fixed in v2.2):
# Before
-่กๅ๏ผ้ไฝ ไพฟใ
-Fine, go ahead.
Do that, then.
# After
- ่กๅ๏ผ้ไฝ ไพฟใ
- Fine, go ahead. Do that, then.
Chinese Punctuation Intelligence (Fixed in v2.2):
# Before (unwanted period)
้ฟ้ฝๅฐ็ปไบๆใ
่ฟไบๆฐ่ฏ็๏ผ
# After (no unwanted period)
้ฟ้ฝๅฐ็ปไบๆ
่ฟไบๆฐ่ฏ็๏ผ
Dialogue Format Optimization:
# Before
-ๆไน่ฟไนๆ๏ผ
-What kept you?
# After
- ๆไน่ฟไนๆ๏ผ
- What kept you?
With intelligent line breaking, reading speed validation, and Netflix compliance.