SpeechFlow

A speech processing toolkit designed for easy configuration of complex speech data preparation pipelines and rapid prototyping of text-to-speech (TTS) models.

Overview

This project provides a comprehensive solution for TTS development, featuring:

Multilingual text processing frontend
Forced alignment models
Modular framework for building TTS systems from reusable components

Check out these examples to learn more about the framework design.

News

April 2025:
- 🔥 SpeechFlow 1.0 is now available!

Installation

Prerequisites

Install Anaconda
Clone a repository and update submodules

git clone https://github.com/just-ai/speechflow
cd speechflow
git submodule update --init --recursive -f

On Ubuntu:

Install system dependencies:

sudo apt-get update
sudo apt-get install -y libssl1.1 g++ wget sox ffmpeg

Configure Python environment:

conda create -n py310 python=3.10
conda activate py310
pip install -r requirements.txt
pip install fairseq==0.12.2 --no-deps

Install multilingual frontend dependencies:

# Install .NET SDK
wget https://packages.microsoft.com/config/ubuntu/20.04/packages-microsoft-prod.deb -O packages-microsoft-prod.deb
sudo dpkg -i packages-microsoft-prod.deb
rm packages-microsoft-prod.deb

sudo apt-get install -y apt-transport-https && apt-get update
sudo apt-get install -y dotnet-sdk-5.0 aspnetcore-runtime-5.0 dotnet-runtime-5.0 nuget

# install eSpeak
sudo apt-get install -y espeak-ng

Complete installation:

sh libs/install.sh
pytest tests  # run verification tests

On Windows:

Install Python 3.10
Install additional components:
Install Python packages:

pip install -r requirements.txt
pip install fairseq==0.12.2 --no-deps
pip install -Ue libs/multilingual_text_parser

Singularity Installation

For containerized deployment:

sh env/singularity.sh  # Install Singularity
sh install.sh
singularity shell --nv --writable --no-home -B .:/src --pwd /src torch_*.img
source /ext3/miniconda3/etc/profile.d/conda.sh && conda activate py310

Data Annotation

Generate Praat/TextGrid annotation files for training datasets.

Our annotation pipeline automates:

Audio segmentation into utterances
Text normalization and phonetic transcription
Forced alignment
Audio postprocessing (sample rate conversion, volume normalization)

Supported Languages

RU, EN, IT, ES, FR-FR, DE, PT, PT-BR, KK (additional languages via eSpeak-NG)

Annotation Process

1) Prepare dataset structure

    dataset_root:
    - languages.yml
    - language_code_1
      - speakers.yml
      - single-speaker-dataset
        - file_1.wav
        - file_1.txt
          ...
        - file_n.wav
        - file_n.txt
      - multi-speaker-dataset
        - speaker_1
          - file_1.wav
          - file_1.txt
          ...
          - file_n.wav
          - file_n.txt
          ...
        - speaker_n
          - file_1.wav
          - file_1.txt
          ...
          - file_n.wav
          - file_n.txt
    - language_code_n
      - speakers.yml
      - dataset_1
      ...
      - dataset_n

We recommend using normalized transcriptions that exclude numbers and abbreviations. For supported languages, this package will automatically handle text normalization.

Transcription files are optional. If only audio files are provided, transcriptions will be generated automatically using the Whisper Large v2 ASR model.

For optimal processing, split large audio files into 20–30 minute segments.

The tool supports annotation of datasets with single or multiple speakers. To better understand the structure of source data directories and the formats of the languages.yml and speakers.yml configuration files, refer to the provided example.

2) Run annotation processing

The annotation process includes segmenting the audio file into single utterances, normalizing the text, generating a phonetic transcription, performing forced alignment of the transcription with the audio chunk, detecting silence, converting the audio sample rate, and equalizing the volume.

We provide pre-trained multilingual forced alignment models at the phoneme level. These models were trained on 1,500 hours of audio (from over 8,000 speakers across 9 languages), including datasets such as LJSpeech, VCTK, LibriTTS, Hi-Fi TTS, and others.

Run this script to get segmentations:

setup for single GPU, the minimum requirement is 64GB of RAM and 24GB of VRAM

python -m annotator.runner
   -d source_dataset_path
   -o segmentation_dataset_name_or_path
   -ngpu 1 -nproc 16
   --pretrained_models mfa_stage1_epoch=29-step=468750.pt mfa_stage2_epoch=59-step=937500.pt
   [--batch_size <int>]  # adjust batch size to match your device’s capabilities (bs=16 by default)

setup for multi GPU, the minimum requirement is 256GB of RAM and 24GB of VRAM per GPU

python -m annotator.runner
   -d source_dataset_path
   -o segmentation_dataset_name_or_path
   -ngpu 4 -nproc 32 -ngw 8
   --pretrained_models mfa_stage1_epoch=29-step=468750.pt mfa_stage2_epoch=59-step=937500.pt
   [--batch_size <int>]

To improve the alignment of your data, use the flag --finetune_model (or without a checkpoint for training from scratch):

python -m annotator.runner
   -d source_dataset_path
   -o segmentation_dataset_name_or_path
   -ngpu 1 -nproc 16
   [--finetune_model mfa_stage1_epoch=29-step=468750.pt]
   [--batch_size <int>]

To process individual audio files, use this interface.

The *.TextGrid files can be opened in Praat. Additional examples are available here.

Our alignment model are based on the Glow-TTS codebase. Unlike CTC-based alignment methods, the MAS-based approach (Monotonic Alignment Search) provides fine-grained token positioning. Additionally, we implemented a two-stage training scheme to detect silences of varying durations between words in speech signal and to insert SIL tokens into corresponding positions in the text transcription.

We also address the long-standing issue of instability during training speech synthesis models, which occurs when voiced phrases contain prolonged silence at the beginning or end. This issue is resolved by annotating such silent segments with BOS/EOS tokens and subsequently removing them during audio files loading in the data preparation pipeline.

Text-to-Speech Development

Note

The default batch size for the training configs is set for a single A100 80GB GPU.

Tip

If you find any issues, add an environment variable VERBOSE=1 to enable extended logging.

Training acoustic models

Build a dump with precompute features for TTS task.

Calculating certain features (e.g., biometric embeddings or SSL features) can be computationally expensive. To optimize batch processing, we precompute these features using a GPU for each data sample and store them on disk. For details about which handlers are cached, refer to the dump section.

python -m tts.acoustic_models.scripts.dump
   -cd tts/acoustic_models/configs/tts/tts_data_24khz.yml
   -nproc 5 -ngpu 1  # [-nproc 20 -ngpu 4] for multi-GPU configuration
   [--data_root segmentation_dataset_path]  # replaces the default dataset path in the config
   [--value_select ru]  # for the Russian language

Training a Conditional Flow Matching (CFM) model

After the dump is created run the script for model training.

python -m tts.acoustic_models.scripts.train
   -cd tts/acoustic_models/configs/tts/tts_data_24khz.yml
   -c tts/acoustic_models/configs/tts/cfm_bigvgan.yml
   [--data_root segmentation_dataset_path] [--value_select ru]
   [--batch_size <int>]  # adjust the batch size to match your device’s capabilities

Training vocoders

You can use BigVGAN for convert the output mel-spectrogram of acoustic model into an audio signal. However, we recommend fine-tuning this vocoder for your voices.

python -m tts.vocoders.scripts.train
   -cd tts/vocoders/configs/vocos/mel_bigvgan_data_24khz.yml
   -c tts/vocoders/configs/vocos/mel_bigvgan.yml
   [--data_root segmentation_dataset_path] [--batch_size <int>]

Training End-to-End TTS

You can also perform joint training of the acoustic model and the vocoder using a GAN-like scheme.

python -m tts.vocoders.scripts.train
   -cd tts/vocoders/configs/vocos/e2e_tts_data_24khz.yml
   -c tts/vocoders/configs/vocos/styletts2_bigvgan.yml
   [--data_root segmentation_dataset_path] [--value_select ru] [--batch_size <int>]

Important

First, create a feature dump for the e2e_tts_data_24khz.yml configuration file

Training expressive and controllable TTS

You can build a prosodic model to enhance the expressiveness of synthetic voices. For further details on this method, please refer to our paper.

Build a dump of the required features

python -m tts.acoustic_models.scripts.dump
   -cd tts/acoustic_models/configs/prosody/prosody_data_24khz.yml
   -nproc 5 -ngpu 1  # [-nproc 20 -ngpu 4] for multi-GPU configuration
   [--data_root segmentation_dataset_path] [--value_select ru]

Training prosody model

python -m tts.acoustic_models.scripts.train
   -cd tts/acoustic_models/configs/prosody/prosody_data_24khz.yml
   -c tts/acoustic_models/configs/prosody/prosody_model.yml
   [--data_root segmentation_dataset_path] [--value_select ru] [--batch_size <int>]

Update datasets

Here, we add a prosody tier to the *.TextGridStage3 segmentation files, which will contain indices of prosodic contours at the word level.

python -m tts.acoustic_models.scripts.prosody_annotation
   -ckpt /path/to/prosody_model_checkpoint
   -nproc 5 -ngpu 1
   [--data_root segmentation_dataset_path] [--batch_size <int>]

Training prosody prediction model using text

python -m nlp.prosody_prediction.scripts.train
   -cd nlp/prosody_prediction/configs/data.yml
   -c nlp/prosody_prediction/configs/model.yml
   [--data_root segmentation_dataset_path] [--value_select ru] [--batch_size <int>]

Training TTS models

Similar to the steps discussed above.

Inference example

See eval.py

BibTeX

@inproceedings{korotkova24_interspeech,
  title     = {Word-level Text Markup for Prosody Control in Speech Synthesis},
  author    = {Yuliya Korotkova and Ilya Kalinovskiy and Tatiana Vakhrusheva},
  year      = {2024},
  booktitle = {Interspeech 2024},
  pages     = {2280--2284},
  doi       = {10.21437/Interspeech.2024-715},
  issn      = {2958-1796},
}

References

Our TTS models heavily rely on insights and code from various projects.

Name		Name	Last commit message	Last commit date
Latest commit History 275 Commits
annotator		annotator
app/streamlit		app/streamlit
docs		docs
env		env
examples		examples
libs		libs
nlp		nlp
speechflow		speechflow
tests		tests
tts		tts
.gitignore		.gitignore
.gitmodules		.gitmodules
.isort.cfg		.isort.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
install.sh		install.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SpeechFlow

Overview

News

Installation

Prerequisites

On Ubuntu:

On Windows:

Singularity Installation

Data Annotation

Supported Languages

Annotation Process

Text-to-Speech Development

Training acoustic models

Training vocoders

Training End-to-End TTS

Training expressive and controllable TTS

Inference example

BibTeX

References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

just-ai/speechflow

Folders and files

Latest commit

History

Repository files navigation

SpeechFlow

Overview

News

Installation

Prerequisites

On Ubuntu:

On Windows:

Singularity Installation

Data Annotation

Supported Languages

Annotation Process

Text-to-Speech Development

Training acoustic models

Training vocoders

Training End-to-End TTS

Training expressive and controllable TTS

Inference example

BibTeX

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages