DiTTo-TTS: Diffusion Transformers for Scalable Text-To-Speech without domain-specific factors

Implemented by: Abdelkrim Halimi, Walid Ghenaiet, Melissa Dahlia Attabi
Original paper by: Keon Lee, Dong Won Kim, Jaehyeon Kim, Seungjun Chung, Jaewoong Cho

Introduction

Latent Diffusion Models (LDMs) have shown high performance in various tasks such as image, audio, and video generation. When applied to Text-To-Speech (TTS), these models require domain-specific factors to ensure proper temporal alignment between text and speech. However, this dependency complicates data preparation and limits model scalability.

DiTTo-TTS introduces a novel approach to overcome these limitations while achieving high performance. This method is based on a Diffusion Transformer (DiT) architecture and integrates a speech length predictor.

Model Components

Speech Length Predictor

Predicts the total length of the audio signal for a given text.
The encoder transforms the text bidirectionally.
The decoder takes encoded audio (NAC) as input and applies a causal mask.
Cross-attention between encoded text and audio enables length prediction.
Trained separately using cross-entropy loss.

Neural Audio Codec (NAC)

Encodes audio signals into latent representations aligned with the text, quantizes them, and then decodes them.

Components:

Encoder
Vector Quantizer
Decoder
Language Model

Diffusion Model

Generates speech from textual representations ( z_{text} ) and audio ( z_{speech} ) using a diffusion process.

Audio-Text Pipeline

Dataset: MLS Librispeech – 10,000 selected French audio samples with text transcriptions.
Preprocessing:
- Text: Tokenization (GPT2, ByT5).
- Audio: Resampled to 24kHz.
Audio Signal Reconstruction: BigVGAN.

For more details, refer to the original research or documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
doc		doc
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DiTTo-TTS: Diffusion Transformers for Scalable Text-To-Speech without domain-specific factors

Introduction

Model Components

Speech Length Predictor

Neural Audio Codec (NAC)

Diffusion Model

Audio-Text Pipeline

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Tikai7/DiTTO-TTS

Folders and files

Latest commit

History

Repository files navigation

DiTTo-TTS: Diffusion Transformers for Scalable Text-To-Speech without domain-specific factors

Introduction

Model Components

Speech Length Predictor

Neural Audio Codec (NAC)

Diffusion Model

Audio-Text Pipeline

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages