Watch the TTSizer Demo & See It In Action:
(The demo above showcases the AnimeVox Character TTS Corpus, a dataset created using TTSizer.)
TTSizer automates the tedious process of creating high-quality Text-To-Speech datasets from raw media. Input a video or audio file, and get back perfectly aligned audio-text pairs for each speaker.
π― End-to-End Automation: From raw media files to cleaned, TTS-ready datasets
π£οΈ Advanced Multi-Speaker Diarization: Handles complex audio with multiple speakers
π€ State-of-the-Art Models - MelBandRoformer, Gemini, CTC-Aligner, Wespeaker
π§ Quality Control: Automatic outlier detection and flagging
βοΈ Fully Configurable: Control every aspect via config.yaml
graph LR
A[π¬ Raw Media] --> B[π€ Extract Audio]
B --> C[π Vocal Separation]
C --> D[π Normalize Volume]
D --> E[βοΈ Speaker Diarization]
E --> F[β±οΈ Forced Alignment]
F --> G[π§ Outlier Detection]
G --> H[π© ASR Validation]
H --> I[β
TTS Dataset]
git clone https://github.com/taresh18/TTSizer.git
cd TTSizer
pip install -r requirements.txt
- Download pre-trained models (see Setup Guide)
- Add
GEMINI_API_KEY
to.env
file in the project root:
GEMINI_API_KEY="YOUR_API_KEY_HERE"
Edit configs/config.yaml
:
project_setup:
video_input_base_dir: "/path/to/your/videos"
output_base_dir: "/path/to/output"
target_speaker_labels: ["Speaker1", "Speaker2"]
python -m ttsizer.main
Click to expand detailed setup instructions
- Python 3.9+
- CUDA enabled GPU (>4GB VRAM)
- FFmpeg (Must be installed and accessible in your system's PATH)
- Google Gemini API key
- Vocal Extraction: Download
kimmel_unwa_ft2_bleedless.ckpt
from HuggingFace - Speaker Embeddings: Download from wespeaker-voxceleb-resnet293-LM
Update model paths in config.yaml
.
Click for pipeline control and other advanced options
You can control which parts of the pipeline run, useful for debugging or reprocessing:
pipeline_control:
run_only_stage: "ctc_align" # Run specific stage only
start_stage: "llm_diarize" # Start from specific stage
end_stage: "outlier_detect" # Stop at specific stage
The project is organized as follows:
TTSizer/
βββ configs/
β βββ config.yaml # Pipeline & model configurations
βββ ttsizer/
β βββ __init__.py
β βββ main.py # Main script to run the pipeline
β βββ core/ # Core components of the pipeline
β βββ models/ # Vocal removal models
β βββ utils/ # Utility programs
βββ .env # For API keys
βββ README.md # This file
βββ requirements.txt # Python package dependencies
βββ weights/ # For storing downloaded model weights (gitignored)
This project is released under the Apache License 2.0. See the LICENSE file for details.
- Vocals Extraction pcunwa/Kim-Mel-Band-Roformer-FT by Unwa
- Forced Alignment: ctc-forced-aligner by MahmoudAshraf97
- ASR: NVIDIA NeMo Parakeet
- Speaker Embeddings: Wespeaker/wespeaker-voxceleb-resnet293-LM from Wespeaker