A speech processing toolkit designed for easy configuration of complex speech data preparation pipelines and rapid prototyping of text-to-speech (TTS) models.
This project provides a comprehensive solution for TTS development, featuring:
- Multilingual text processing frontend
- Forced alignment models
- Modular framework for building TTS systems from reusable components
Check out these examples to learn more about the framework design.
- April 2025:
- 🔥 SpeechFlow 1.0 is now available!
- Install Anaconda
- Clone a repository and update submodules
git clone https://github.com/just-ai/speechflow
cd speechflow
git submodule update --init --recursive -f
- Install system dependencies:
sudo apt-get update
sudo apt-get install -y libssl1.1 g++ wget sox ffmpeg
- Configure Python environment:
conda create -n py310 python=3.10
conda activate py310
pip install -r requirements.txt
pip install fairseq==0.12.2 --no-deps
- Install multilingual frontend dependencies:
# Install .NET SDK
wget https://packages.microsoft.com/config/ubuntu/20.04/packages-microsoft-prod.deb -O packages-microsoft-prod.deb
sudo dpkg -i packages-microsoft-prod.deb
rm packages-microsoft-prod.deb
sudo apt-get install -y apt-transport-https && apt-get update
sudo apt-get install -y dotnet-sdk-5.0 aspnetcore-runtime-5.0 dotnet-runtime-5.0 nuget
# install eSpeak
sudo apt-get install -y espeak-ng
- Complete installation:
sh libs/install.sh
pytest tests # run verification tests
-
Install Python 3.10
-
Install additional components:
-
Install Python packages:
pip install -r requirements.txt
pip install fairseq==0.12.2 --no-deps
pip install -Ue libs/multilingual_text_parser
For containerized deployment:
sh env/singularity.sh # Install Singularity
sh install.sh
singularity shell --nv --writable --no-home -B .:/src --pwd /src torch_*.img
source /ext3/miniconda3/etc/profile.d/conda.sh && conda activate py310
Generate Praat/TextGrid annotation files for training datasets.
Our annotation pipeline automates:
- Audio segmentation into utterances
- Text normalization and phonetic transcription
- Forced alignment
- Audio postprocessing (sample rate conversion, volume normalization)
RU, EN, IT, ES, FR-FR, DE, PT, PT-BR, KK (additional languages via eSpeak-NG)
1) Prepare dataset structure
dataset_root:
- languages.yml
- language_code_1
- speakers.yml
- single-speaker-dataset
- file_1.wav
- file_1.txt
...
- file_n.wav
- file_n.txt
- multi-speaker-dataset
- speaker_1
- file_1.wav
- file_1.txt
...
- file_n.wav
- file_n.txt
...
- speaker_n
- file_1.wav
- file_1.txt
...
- file_n.wav
- file_n.txt
- language_code_n
- speakers.yml
- dataset_1
...
- dataset_n
We recommend using normalized transcriptions that exclude numbers and abbreviations. For supported languages, this package will automatically handle text normalization.
Transcription files are optional. If only audio files are provided, transcriptions will be generated automatically using the Whisper Large v2 ASR model.
For optimal processing, split large audio files into 20–30 minute segments.
The tool supports annotation of datasets with single or multiple speakers. To better understand the structure of source data directories and the formats of the languages.yml and speakers.yml configuration files, refer to the provided example.
2) Run annotation processing
The annotation process includes segmenting the audio file into single utterances, normalizing the text, generating a phonetic transcription, performing forced alignment of the transcription with the audio chunk, detecting silence, converting the audio sample rate, and equalizing the volume.
We provide pre-trained multilingual forced alignment models at the phoneme level. These models were trained on 1,500 hours of audio (from over 8,000 speakers across 9 languages), including datasets such as LJSpeech, VCTK, LibriTTS, Hi-Fi TTS, and others.
Run this script to get segmentations:
setup for single GPU, the minimum requirement is 64GB of RAM and 24GB of VRAM
python -m annotator.runner
-d source_dataset_path
-o segmentation_dataset_name_or_path
-ngpu 1 -nproc 16
--pretrained_models mfa_stage1_epoch=29-step=468750.pt mfa_stage2_epoch=59-step=937500.pt
[--batch_size <int>] # adjust batch size to match your device’s capabilities (bs=16 by default)
setup for multi GPU, the minimum requirement is 256GB of RAM and 24GB of VRAM per GPU
python -m annotator.runner
-d source_dataset_path
-o segmentation_dataset_name_or_path
-ngpu 4 -nproc 32 -ngw 8
--pretrained_models mfa_stage1_epoch=29-step=468750.pt mfa_stage2_epoch=59-step=937500.pt
[--batch_size <int>]
To improve the alignment of your data, use the flag --finetune_model
(or without a checkpoint for training from scratch):
python -m annotator.runner
-d source_dataset_path
-o segmentation_dataset_name_or_path
-ngpu 1 -nproc 16
[--finetune_model mfa_stage1_epoch=29-step=468750.pt]
[--batch_size <int>]
To process individual audio files, use this interface.
The *.TextGrid
files can be opened in Praat. Additional examples are available here.
Our alignment model are based on the Glow-TTS codebase. Unlike CTC-based alignment methods, the MAS-based approach (Monotonic Alignment Search) provides fine-grained token positioning. Additionally, we implemented a two-stage training scheme to detect silences of varying durations between words in speech signal and to insert SIL tokens into corresponding positions in the text transcription.
We also address the long-standing issue of instability during training speech synthesis models, which occurs when voiced phrases contain prolonged silence at the beginning or end. This issue is resolved by annotating such silent segments with BOS/EOS tokens and subsequently removing them during audio files loading in the data preparation pipeline.
Note
The default batch size for the training configs is set for a single A100 80GB GPU.
Tip
If you find any issues, add an environment variable VERBOSE=1
to enable extended logging.
- Build a dump with precompute features for TTS task.
Calculating certain features (e.g., biometric embeddings or SSL features) can be computationally expensive. To optimize batch processing, we precompute these features using a GPU for each data sample and store them on disk. For details about which handlers are cached, refer to the dump section.
python -m tts.acoustic_models.scripts.dump
-cd tts/acoustic_models/configs/tts/tts_data_24khz.yml
-nproc 5 -ngpu 1 # [-nproc 20 -ngpu 4] for multi-GPU configuration
[--data_root segmentation_dataset_path] # replaces the default dataset path in the config
[--value_select ru] # for the Russian language
- Training a Conditional Flow Matching (CFM) model
After the dump is created run the script for model training.
python -m tts.acoustic_models.scripts.train
-cd tts/acoustic_models/configs/tts/tts_data_24khz.yml
-c tts/acoustic_models/configs/tts/cfm_bigvgan.yml
[--data_root segmentation_dataset_path] [--value_select ru]
[--batch_size <int>] # adjust the batch size to match your device’s capabilities
You can use BigVGAN for convert the output mel-spectrogram of acoustic model into an audio signal. However, we recommend fine-tuning this vocoder for your voices.
python -m tts.vocoders.scripts.train
-cd tts/vocoders/configs/vocos/mel_bigvgan_data_24khz.yml
-c tts/vocoders/configs/vocos/mel_bigvgan.yml
[--data_root segmentation_dataset_path] [--batch_size <int>]
You can also perform joint training of the acoustic model and the vocoder using a GAN-like scheme.
python -m tts.vocoders.scripts.train
-cd tts/vocoders/configs/vocos/e2e_tts_data_24khz.yml
-c tts/vocoders/configs/vocos/styletts2_bigvgan.yml
[--data_root segmentation_dataset_path] [--value_select ru] [--batch_size <int>]
Important
First, create a feature dump for the e2e_tts_data_24khz.yml configuration file
You can build a prosodic model to enhance the expressiveness of synthetic voices. For further details on this method, please refer to our paper.
- Build a dump of the required features
python -m tts.acoustic_models.scripts.dump
-cd tts/acoustic_models/configs/prosody/prosody_data_24khz.yml
-nproc 5 -ngpu 1 # [-nproc 20 -ngpu 4] for multi-GPU configuration
[--data_root segmentation_dataset_path] [--value_select ru]
- Training prosody model
python -m tts.acoustic_models.scripts.train
-cd tts/acoustic_models/configs/prosody/prosody_data_24khz.yml
-c tts/acoustic_models/configs/prosody/prosody_model.yml
[--data_root segmentation_dataset_path] [--value_select ru] [--batch_size <int>]
- Update datasets
Here, we add a prosody tier to the *.TextGridStage3
segmentation files, which will contain indices of prosodic contours at the word level.
python -m tts.acoustic_models.scripts.prosody_annotation
-ckpt /path/to/prosody_model_checkpoint
-nproc 5 -ngpu 1
[--data_root segmentation_dataset_path] [--batch_size <int>]
- Training prosody prediction model using text
python -m nlp.prosody_prediction.scripts.train
-cd nlp/prosody_prediction/configs/data.yml
-c nlp/prosody_prediction/configs/model.yml
[--data_root segmentation_dataset_path] [--value_select ru] [--batch_size <int>]
-
Training TTS models
Similar to the steps discussed above.
See eval.py
@inproceedings{korotkova24_interspeech,
title = {Word-level Text Markup for Prosody Control in Speech Synthesis},
author = {Yuliya Korotkova and Ilya Kalinovskiy and Tatiana Vakhrusheva},
year = {2024},
booktitle = {Interspeech 2024},
pages = {2280--2284},
doi = {10.21437/Interspeech.2024-715},
issn = {2958-1796},
}
Our TTS models heavily rely on insights and code from various projects.
Glow-TTS | FastSpeech | ForwardTacotron | Tacotron2 | StyleTTS2 | StableTTS | XTTS | Vocos | HiFi-GAN | BigVGAN