Implemented by: Abdelkrim Halimi, Walid Ghenaiet, Melissa Dahlia Attabi
Original paper by: Keon Lee, Dong Won Kim, Jaehyeon Kim, Seungjun Chung, Jaewoong Cho
Latent Diffusion Models (LDMs) have shown high performance in various tasks such as image, audio, and video generation. When applied to Text-To-Speech (TTS), these models require domain-specific factors to ensure proper temporal alignment between text and speech. However, this dependency complicates data preparation and limits model scalability.
DiTTo-TTS introduces a novel approach to overcome these limitations while achieving high performance. This method is based on a Diffusion Transformer (DiT) architecture and integrates a speech length predictor.
- Predicts the total length of the audio signal for a given text.
- The encoder transforms the text bidirectionally.
- The decoder takes encoded audio (NAC) as input and applies a causal mask.
- Cross-attention between encoded text and audio enables length prediction.
- Trained separately using cross-entropy loss.

Encodes audio signals into latent representations aligned with the text, quantizes them, and then decodes them.
Components:
- Encoder
- Vector Quantizer
- Decoder
- Language Model

- Generates speech from textual representations ( z_{text} ) and audio ( z_{speech} ) using a diffusion process.

- Dataset: MLS Librispeech – 10,000 selected French audio samples with text transcriptions.
- Preprocessing:
- Text: Tokenization (GPT2, ByT5).
- Audio: Resampled to 24kHz.
- Audio Signal Reconstruction: BigVGAN.

For more details, refer to the original research or documentation.