TTS Dataset Generator using Whisper Diarization

This is a fork of whisper-diarization

Purpose

The primary goal of this fork is to adapt the original speaker diarization pipeline to streamline the creation of Text-To-Speech (TTS) datasets.

While the original tool provides timed speaker transcripts (.srt), this fork modifies the process to directly output speaker-specific audio chunks and their corresponding transcription text files, reducing the manual steps needed for TTS data preparation.

It still leverages the core strengths of the original pipeline:

Accurate transcription via Whisper.
Speaker segment identification via NeMo diarization.
Precise timing information.

Key Changes in this Fork

Configuration File: Uses a config.yaml file to specify input audio and output paths, instead of relying solely on command-line arguments.
Automated Output: Directly generates segmented audio chunks (.wav) and individual transcription files (.txt) organized by speaker, in addition to the standard .srt file.

Installation

This fork requires the same dependencies as the original whisper-diarization. Please follow the installation instructions provided in the original repository's README.

(No additional dependencies specific to this fork's primary purpose have been added unless your specific code changes require them).

Usage for TTS Dataset Preparation

Configure config.yaml: Create or edit the config.yaml file in the repository root. Specify at least the following:

# Example config.yaml structure (adjust based on your actual implementation)
audio_path: "/path/to/your/input/audio.mp3"
output_directory: "/path/to/your/output/dataset_folder"

Run the Script: Execute the main diarization script from the repository root:
```
python diarize.py
```
Check the Output: Navigate to the output_directory you specified in config.yaml. You should find:
- An .srt file containing the full diarized transcript (e.g., audio.srt).
- Subdirectories for each identified speaker (e.g., speaker_0/, speaker_1/, etc.).
- Inside each speaker directory:
  - Numbered audio chunk files (e.g., segment_001.wav, segment_002.wav, ...).
  - Corresponding transcription text files (e.g., segment_001.txt, segment_002.txt, ...).

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
.github/workflows		.github/workflows
nemo_msdd_configs		nemo_msdd_configs
tests/assets		tests/assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Whisper_Transcription_+_NeMo_Diarization.ipynb		Whisper_Transcription_+_NeMo_Diarization.ipynb
config.yaml		config.yaml
constraints.txt		constraints.txt
diarize.py		diarize.py
diarize_parallel.py		diarize_parallel.py
helpers.py		helpers.py
nemo_process.py		nemo_process.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TTS Dataset Generator using Whisper Diarization

Purpose

Key Changes in this Fork

Installation

Usage for TTS Dataset Preparation

About

Uh oh!

Releases

Packages

Languages

License

taresh18/whisper-diarization

Folders and files

Latest commit

History

Repository files navigation

TTS Dataset Generator using Whisper Diarization

Purpose

Key Changes in this Fork

Installation

Usage for TTS Dataset Preparation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages