Skip to content

πŸŽ™οΈ Automatically transcribe audio/video into high-quality, speaker-specific Text-To-Speech datasets ✨

License

Notifications You must be signed in to change notification settings

taresh18/TTSizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TTSizer πŸŽ™οΈβœ¨

Transform Raw Audio/Video into Production-Ready TTS Datasets

License: Apache 2.0 Python Version

Watch the TTSizer Demo & See It In Action: TTSizer Demo Video (The demo above showcases the AnimeVox Character TTS Corpus, a dataset created using TTSizer.)

🎯 What It Does

TTSizer automates the tedious process of creating high-quality Text-To-Speech datasets from raw media. Input a video or audio file, and get back perfectly aligned audio-text pairs for each speaker.

✨ Key Features

🎯 End-to-End Automation: From raw media files to cleaned, TTS-ready datasets
πŸ—£οΈ Advanced Multi-Speaker Diarization: Handles complex audio with multiple speakers
πŸ€– State-of-the-Art Models - MelBandRoformer, Gemini, CTC-Aligner, Wespeaker
🧐 Quality Control: Automatic outlier detection and flagging
βš™οΈ Fully Configurable: Control every aspect via config.yaml

πŸ“Š Pipeline Flow

graph LR
    A[🎬 Raw Media] --> B[🎀 Extract Audio]
    B --> C[πŸ”‡ Vocal Separation]  
    C --> D[πŸ”Š Normalize Volume]
    D --> E[✍️ Speaker Diarization]
    E --> F[⏱️ Forced Alignment]
    F --> G[🧐 Outlier Detection]
    G --> H[🚩 ASR Validation]
    H --> I[βœ… TTS Dataset]
Loading

πŸƒ Quick Start

1. Clone & Install

git clone https://github.com/taresh18/TTSizer.git
cd TTSizer
pip install -r requirements.txt

2. Setup Models & API Key

  • Download pre-trained models (see Setup Guide)
  • Add GEMINI_API_KEY to .env file in the project root:
GEMINI_API_KEY="YOUR_API_KEY_HERE"

3. Configure

Edit configs/config.yaml:

project_setup:
  video_input_base_dir: "/path/to/your/videos"
  output_base_dir: "/path/to/output"
  target_speaker_labels: ["Speaker1", "Speaker2"]

4. Run TTSizer!

python -m ttsizer.main

πŸ› οΈ Setup & Installation

Click to expand detailed setup instructions

Prerequisites

  • Python 3.9+
  • CUDA enabled GPU (>4GB VRAM)
  • FFmpeg (Must be installed and accessible in your system's PATH)
  • Google Gemini API key

Manual Model Downloads

  1. Vocal Extraction: Download kimmel_unwa_ft2_bleedless.ckpt from HuggingFace
  2. Speaker Embeddings: Download from wespeaker-voxceleb-resnet293-LM

Update model paths in config.yaml.

βš™οΈ Advanced Configuration

Click for pipeline control and other advanced options

Selective Stage Execution

You can control which parts of the pipeline run, useful for debugging or reprocessing:

pipeline_control:
  run_only_stage: "ctc_align"      # Run specific stage only
  start_stage: "llm_diarize"       # Start from specific stage  
  end_stage: "outlier_detect"      # Stop at specific stage

πŸ—οΈ Project Structure

The project is organized as follows:

TTSizer/
β”œβ”€β”€ configs/
β”‚   └── config.yaml                 # Pipeline & model configurations
β”œβ”€β”€ ttsizer/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ main.py                     # Main script to run the pipeline
β”‚   │── core/                       # Core components of the pipeline
β”‚   β”œβ”€β”€ models/                     # Vocal removal models
β”‚   └── utils/                      # Utility programs
β”œβ”€β”€ .env                            # For API keys
β”œβ”€β”€ README.md                       # This file
β”œβ”€β”€ requirements.txt                # Python package dependencies
└── weights/                        # For storing downloaded model weights (gitignored)

πŸ“œ License

This project is released under the Apache License 2.0. See the LICENSE file for details.

πŸ“š References

About

πŸŽ™οΈ Automatically transcribe audio/video into high-quality, speaker-specific Text-To-Speech datasets ✨

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •  

Languages