Skip to content

A sophisticated multi-language subtitle processing tool implementing Netflix-compliant standards with intelligent line breaking, enhanced bilingual validation, and default SDH removal for improved readability.

Notifications You must be signed in to change notification settings

MaurUppi/srt-subtitle-processor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

6 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

SRT Subtitle Processor v2.6

Quality Check Python 3.8+ Code style: black License: MIT Maintenance

A sophisticated multi-language subtitle processing tool that implements Netflix-compliant subtitle standards with intelligent line breaking, enhanced bilingual validation, and default SDH removal for improved readability.

๐ŸŒŸ Features

Core Functionality

  • Multi-language Support: Chinese, English, Korean, Japanese with auto-detection
  • Netflix Standard Compliance: Character limits and reading speeds per Netflix guidelines
  • Intelligent Line Breaking: Context-aware breaking with language-specific rules
  • Smart SDH Processing: Default removal of audio/music markers for cleaner subtitles (v2.6)

v2.6 Enhanced Features

  • Default SDH Removal: Automatically removes audio/music markers for cleaner subtitles
  • Unicode SDH Support: Handles both ASCII () and full-width ๏ผˆ๏ผ‰ parentheses in multilingual content
  • Smart Content Preservation: Preserves dialogue while removing SDH markers from mixed content
  • Flexible SDH Control: Use --keep-sdh flag when SDH markers are needed

v2.5 Enhanced Features

  • Bilingual Validation Fix: Per-line language detection for accurate character limit validation
  • Language-Specific Violations: Enhanced violation output showing language codes and specific limits
  • Violation Output Export: New --output-violation parameter to save violations to separate SRT files
  • Reading Speed Analysis: Improved bilingual content reading speed calculation and reporting
  • Mixed-Language Support: Correct handling of Chinese-English, Korean-English mixed content

v2.4 Enhanced Features

  • Validation-Only Mode: New --check-only parameter for compliance checking without processing
  • Detailed Validation Reports: Comprehensive violation analysis with categorized warnings
  • Batch Validation: Check multiple files simultaneously with summary statistics
  • Enhanced CLI Output: Improved verbose mode with validation status indicators
  • Compliance Scoring: Visual compliance rate indicators (โœ… โš ๏ธ โŒ) based on violation percentages

v2.3 Enhanced Features

  • Complete Korean Language Support: Full Korean processor implementation with dialogue formatting and intelligent line breaking
  • Korean Dialogue Formatting: Proper spacing after Korean dialogue markers (e.g., -์—ฌ๊ธฐ ์™”๋‹ค โ†’ - ์—ฌ๊ธฐ ์™”๋‹ค)
  • Korean Particle Detection: Intelligent line breaking at Korean particles and endings (์€/๋Š”, ์ด/๊ฐ€, ์„/๋ฅผ, etc.)
  • Korean Text Validation: Proper character counting and reading speed validation for Korean content
  • Bilingual KO-CN Support: Enhanced processing for Korean-Chinese mixed content

v2.2 Enhanced Features

  • Dialogue Format Optimization: Auto-add spaces after "-" markers
  • Bilingual Processing: Intelligent handling of mixed Chinese-English content
  • Smart English Line Merging: Merge short continuation lines (e.g., "Do that, then.")
  • Chinese Punctuation Intelligence: Prevent unwanted periods in sentence continuations
  • Smart Threshold Detection:
    • Chinese: Don't break if remaining < 3 characters
    • English: Don't break if remaining < 4 complete words or creating lines < 20 chars
  • Minimum Line Length: Ensure Chinese post-break lines โ‰ฅ 5 characters
  • SDH Audio Merging: Combine repeated audio markers (โ™ชโ™ช,โ™ชโ™ช)
  • Context-Aware Punctuation: Add missing sentence-ending punctuation only for complete sentences
  • Assistant Word Breaking: Optimize breaks at Chinese helper words (็š„ใ€ๅœฐใ€ๅพ—)

๐Ÿ“Š Language Standards

Language Character Limit SDH Limit Reading Speed (Adult/Children)
Chinese 16 18 9/7 chars/sec
English 42 42 20/17 chars/sec
Korean 16 16 12/9 chars/sec
Japanese 13 16 4/4 chars/sec (SDH: 7/7)

๐Ÿš€ Installation

# Set up & activate virtual environment
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
uv pip install -r requirements.txt

# Install development dependencies (optional)
uv pip install -r requirements-dev.txt

# Exit when done
deactivate

๐Ÿ’ป Usage

Command Line Interface

# Process single file with default SDH removal (output auto-generated if not specified)
python src/main.py input.srt [output.srt]

# Using module syntax (alternative)
python -m src.srt_processor.cli input.srt [output.srt]

# Keep SDH markers (disable default removal)
python src/main.py input.srt --keep-sdh

# Auto-detect language (default)
python src/main.py input.srt --language auto

# Specify language explicitly
python src/main.py input.srt --language ko  # Korean
python src/main.py input.srt --language zh  # Chinese
python src/main.py input.srt --language en  # English

# Enable SDH mode with SDH markers preserved
python src/main.py input.srt --sdh --keep-sdh

# Disable speed checking (useful for development/testing)
python src/main.py input.srt --no-speed-check

# Disable punctuation correction
python src/main.py input.srt --no-punct-fix

# Batch process directory (default SDH removal)
python src/main.py --batch /path/to/srt/files

# Batch process directory keeping SDH markers
python src/main.py --batch /path/to/srt/files --keep-sdh

# Verbose output with detailed processing info
python src/main.py input.srt --verbose

# Validation-only mode with default SDH removal
python src/main.py input.srt --check-only

# Validation with SDH markers preserved
python src/main.py input.srt --check-only --keep-sdh

# Validation with speed checking disabled
python src/main.py input.srt --check-only --no-speed-check

# Export violations to separate file
python src/main.py input.srt --check-only --output-violation

# Export violations with custom filename
python src/main.py input.srt --check-only --output-violation violations.srt

# Batch validation for quality assurance (default SDH removal)
python src/main.py --batch /path/to/srt/files --check-only

# Batch validation keeping SDH markers
python src/main.py --batch /path/to/srt/files --check-only --keep-sdh

# Combine options
python src/main.py input.srt --language ko --verbose --no-speed-check --keep-sdh

Programmatic Usage

from src.srt_processor.core.processor import SRTProcessor
from src.srt_processor.models.subtitle import ProcessingConfig, Language

# Create configuration with default SDH removal
config = ProcessingConfig(
    language=Language.AUTO,
    sdh_mode=False,
    no_punct_fix=False,
    remove_sdh=True  # Default in v2.6
)

# Create configuration keeping SDH markers
config_keep_sdh = ProcessingConfig(
    language=Language.AUTO,
    sdh_mode=False,
    no_punct_fix=False,
    remove_sdh=False  # Explicitly disable SDH removal
)

# Process file with default SDH removal
processor = SRTProcessor(config)
result = processor.process_file("input.srt", "output.srt")

# Process file keeping SDH markers
processor_keep_sdh = SRTProcessor(config_keep_sdh)
result_keep_sdh = processor_keep_sdh.process_file("input_with_sdh.srt", "output_with_sdh.srt")

๐Ÿงช Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=src --cov-report=html

# Run specific test
pytest tests/test_processors.py

๐Ÿ›  Development

Code Quality

# Format code
black .

# Sort imports
isort .

# Lint code
flake8

# Type checking
mypy src/

Demo

# Run demo with sample data
python demo.py

๐Ÿ“ Project Structure

src/
โ”œโ”€โ”€ srt_processor/
โ”‚   โ”œโ”€โ”€ models/          # Data models and configurations
โ”‚   โ”œโ”€โ”€ core/            # Parsing, language detection, main processor
โ”‚   โ”œโ”€โ”€ processors/      # Language-specific processors
โ”‚   โ”œโ”€โ”€ utils/           # Utility functions
โ”‚   โ””โ”€โ”€ cli.py          # Command line interface
tests/                   # Comprehensive test suite
demo.py                 # Demonstration script

๐ŸŽฏ Key Algorithms

Chinese Processing

  • Threshold Rule: Don't break if remaining < 3 characters
  • Minimum Length: Ensure post-break lines โ‰ฅ 5 characters
  • Helper Word Breaking: Prioritize breaks at ็š„ใ€ๅœฐใ€ๅพ—ใ€ไบ†ใ€ๅงใ€ๅ‘ข
  • Smart Punctuation: Add missing "ใ€‚" only for complete sentences, not continuations
  • Continuation Detection: Identify sentence continuations to prevent unwanted punctuation

English Processing

  • Enhanced Word Threshold: Don't break if remaining < 4 complete words or creating lines < 20 chars
  • Smart Line Merging: Automatically merge short continuation lines in bilingual content
  • Grammar Optimization: Break before conjunctions and prepositions
  • Word Boundary: Maintain complete word integrity
  • Dialogue Continuation: Merge dialogue lines with non-dialogue continuations

Korean Processing (NEW v2.3)

  • Particle Detection: Intelligent breaking at Korean particles (์€/๋Š”, ์ด/๊ฐ€, ์„/๋ฅผ, ์—์„œ, ๋กœ, ๊ณผ/์™€, etc.)
  • Dialogue Formatting: Automatic spacing adjustment for Korean dialogue markers
  • Minimum Length: Ensure post-break lines โ‰ฅ 4 characters for Korean readability
  • Space Preservation: Maintain word boundaries for Korean mixed-script content
  • Continuation Patterns: Detect Korean connectors (๊ณ , ์„œ, ๋ฉด, ๋ฉฐ, ๋Š”๋ฐ, ์ง€๋งŒ, ํ•˜๊ณ , ๊ฐ€์ง€๊ณ )

SDH Processing

  • Marker Detection: Identify โ™ช, [audio], (sound) patterns
  • Auto Merging: Combine repeated markers with comma separation
  • Enhanced Limits: Apply SDH-specific character limits

๐Ÿ“ˆ Performance

  • Netflix Compliance: 100% adherence to official subtitle standards
  • Multi-language: Automatic detection with 95%+ accuracy
  • Processing Speed: Optimized for batch processing large subtitle libraries
  • Memory Efficient: Streaming processing for large files

๐Ÿ›ก๏ธ Quality & Security

This project maintains high code quality and security standards through automated checks:

Code Quality

  • Formatting: Code formatted with Black (88-character line length)
  • Import Sorting: Imports organized with isort
  • Linting: Code quality checked with flake8
  • Complexity: Maximum cyclomatic complexity of 10

Security Scanning

  • Vulnerability Detection: Dependencies scanned with Safety
  • Security Linting: Source code analyzed with Bandit
  • Automated Reports: Security findings uploaded as CI artifacts

Continuous Integration

  • Quality Check: Automated formatting and security scanning on every commit
  • Basic Functionality: CLI functionality tested with sample files
  • Cross-Platform: Compatible with Linux, macOS, and Windows

All checks must pass before code can be merged to the main branch.

๐Ÿค Contributing

  1. Follow existing code style (Black, isort, flake8)
  2. Add comprehensive tests for new features
  3. Update documentation for API changes
  4. Ensure Netflix standard compliance
  5. All CI checks must pass

๐Ÿ“„ License

This implementation follows Netflix's publicly available subtitle standards and best practices for accessibility and international content distribution.

๐Ÿ”ง Configuration Options

Parameter Description Default
--language Target language (auto/zh/en/ko/ja) auto
--content-type Adult or children content adult
--sdh Enable SDH mode false
--keep-sdh Keep SDH markers (disable default removal) (NEW v2.6) false
--no-speed-check Disable reading speed validation false
--no-punct-fix Disable auto punctuation false
--force-encoding Override output encoding auto-detect
--verbose Enable detailed output false
--check-only Validate without processing false
--output-violation Export violations to file false

Note: As of v2.6, SDH removal is enabled by default to improve subtitle readability. Use --keep-sdh when SDH markers are required.

๐ŸŽฌ Sample Output

v2.6 Default SDH Removal Examples

Processing with Default SDH Removal:

$ python src/main.py bilingual_with_sdh.srt
Processing: bilingual_with_sdh.srt
Language detected: zh
SDH removal: Enabled (default)
Removed 15 SDH-only blocks
Cleaned SDH markers from 8 mixed content blocks
Processed: bilingual_with_sdh.srt -> bilingual_with_sdh_processed.srt

Processing with SDH Markers Preserved:

$ python src/main.py bilingual_with_sdh.srt --keep-sdh
Processing: bilingual_with_sdh.srt
Language detected: zh
SDH removal: Disabled (--keep-sdh)
Processed: bilingual_with_sdh.srt -> bilingual_with_sdh_processed.srt

Before (with SDH markers):

1
00:00:04,967 --> 00:00:07,467
โ™ชโ™ช

2
00:00:07,600 --> 00:00:14,333
๏ผˆ้Ÿณไนๅ“่ตท๏ผ‰
(MUSIC PLAYS)

3
00:01:11,733 --> 00:01:12,800
- ไฝ ๅฅฝ๏ผ๏ผˆ็ฌ‘ๅฃฐ๏ผ‰
- Hello! (LAUGHTER)

After (v2.6 default behavior):

1
00:01:11,733 --> 00:01:12,800
- ไฝ ๅฅฝ๏ผ
- Hello!

v2.5 Bilingual Validation Examples

Before v2.5 (Incorrect):

$ python src/main.py bilingual.srt --check-only
Block 65: Line 2 exceeds character limit (20 > 16)  # โŒ English text wrongly validated against Chinese limit

After v2.5 (Correct):

$ python src/main.py bilingual.srt --check-only
Block 65: Reading speed too fast (17.5 > 9.0 chars/sec)  # โœ… Only legitimate violations shown

Violation Export (NEW v2.5):

$ python src/main.py bilingual.srt --check-only --output-violation
Violations saved to: bilingual-violation.srt

# Content of bilingual-violation.srt:
1
00:00:01,000 --> 00:00:03,000
# VIOLATIONS SUMMARY
# Reading Speed Violations: 1
# Character Limit Violations: 0
# Language Detection: Chinese (primary)

2
00:04:31,667 --> 00:04:33,333
-ไธบไป€ไนˆ่ฎฉๅฅน้ ่ฟ‘ๅฐธไฝ“๏ผŸ
-Why did you let her
near the body?
# VIOLATIONS: Reading speed (17.5 > 9.0 chars/sec)

v2.4 Validation-Only Mode Examples

Single File Validation:

$ python src/main.py samples/CHS-KOR.srt --check-only
Checking: samples/CHS-KOR.srt
Language detected: ko
Total blocks: 1680

=== VALIDATION REPORT ===
Character Limit Violations: 1026
  ๐Ÿ“ Block 1: Exceeds character limit (34 > 16)
  ๐Ÿ“ Block 2: Exceeds character limit (19 > 16)
  ... and 1016 more character limit violations

Reading Speed Violations: 905
  โฑ๏ธ  Block 1: Reading speed too fast (16.6 > 9.0 chars/sec)
  โฑ๏ธ  Block 2: Reading speed too fast (13.8 > 12.0 chars/sec)
  ... and 895 more speed violations

=== SUMMARY ===
โŒ Compliance: 30.4% (511/1680 blocks)
โš ๏ธ  Total Violations: 1931
๐Ÿ“Š Character Limit: 1026 violations
โฑ๏ธ  Reading Speed: 905 violations

Batch Validation:

$ python src/main.py --batch samples --check-only
Checking 15 SRT files in samples

โŒ CHS-KOR.srt - 30.4% (1931 violations)
โš ๏ธ Phanteam.chs-kor.srt - 80.6% (253 violations)
โœ… Bouquet.CHS.srt - 95.8% (72 violations)

Batch checking complete:
  Checked: 15
  Total violations: 18723
  Average violations per file: 1248.2

v2.3 Bilingual Processing Examples

Korean Dialogue Formatting (New in v2.3):

# Before
-์—ฌ๊ธฐ ์™”๋‹ค
-์•„, ๊นœ์ง์ด์•ผ

# After  
- ์—ฌ๊ธฐ ์™”๋‹ค
- ์•„, ๊นœ์ง์ด์•ผ.

English Line Merging (Fixed in v2.2):

# Before
-่กŒๅ•Š๏ผŒ้šไฝ ไพฟใ€‚
-Fine, go ahead.
Do that, then.

# After  
- ่กŒๅ•Š๏ผŒ้šไฝ ไพฟใ€‚
- Fine, go ahead. Do that, then.

Chinese Punctuation Intelligence (Fixed in v2.2):

# Before (unwanted period)
้˜ฟ้ฝๅฐ”็ป™ไบ†ๆˆ‘ใ€‚
่ฟ™ไบ›ๆ–ฐ่ฏ็‰‡๏ผŒ

# After (no unwanted period)
้˜ฟ้ฝๅฐ”็ป™ไบ†ๆˆ‘
่ฟ™ไบ›ๆ–ฐ่ฏ็‰‡๏ผŒ

Dialogue Format Optimization:

# Before
-ๆ€Žไนˆ่ฟ™ไนˆๆ™š๏ผŸ
-What kept you?

# After
- ๆ€Žไนˆ่ฟ™ไนˆๆ™š๏ผŸ
- What kept you?

With intelligent line breaking, reading speed validation, and Netflix compliance.

About

A sophisticated multi-language subtitle processing tool implementing Netflix-compliant standards with intelligent line breaking, enhanced bilingual validation, and default SDH removal for improved readability.

Resources

Stars

Watchers

Forks

Contributors 2

  •  
  •  

Languages