Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis (III-CSS)

Introduction

This is an implementation of the following paper. "Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis" (Accepted by ICASSP 2025)

Zhenqi Jia, Rui Liu

Corresponding Author: Rui Liu

Demo Page

Speech Demo

Dataset

You can download the dataset from DailyTalk.

Pre-trained models

The Hugging Face URL of Sentence-BERT: https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1

The Hugging Face URL of Wav2Vec2-IEMOCAP: https://huggingface.co/speechbrain/emotion-recognition-wav2vec2-IEMOCAP

Preprocessing

Run

python3 prepare_align.py --dataset DailyTalk

for some preparations.

For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Pre-extracted alignments for the datasets are provided here. You have to unzip the files in preprocessed_data/DailyTalk/TextGrid/. Alternately, you can run the aligner by yourself. Please note that our pretrained models are not trained with supervised duration modeling (they are trained with learn_alignment: True).

After that, run the preprocessing script by

python3 preprocess.py --dataset DailyTalk

Training

Train III-CSS with

python3 train.py --dataset DailyTalk

Inference

Only the batch inference is supported as the generation of a turn may need contextual history of the conversation. Try

python3 synthesize.py --source preprocessed_data/DailyTalk/test_*.txt --restore_step RESTORE_STEP --mode batch --dataset DailyTalk

to synthesize all utterances in preprocessed_data/DailyTalk/test_*.txt.

Citation

If you would like to use our dataset and code or refer to our paper, please cite as follows.

@INPROCEEDINGS{10890216,
  author={Jia, Zhenqi and Liu, Rui},
  booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis}, 
  year={2025},
  volume={},
  number={},
  pages={1-5},
  keywords={Training;Codes;Speech coding;Signal processing;Acoustics;Speech synthesis;History;Context modeling;Conversational Speech Synthesis;Contrastive Learning;Conversational Prosody;Intra-modal Interaction;Inter-modal Interaction},
  doi={10.1109/ICASSP49660.2025.10890216}}

Contact the Author

E-mail：jiazhenqi7@163.com

Homepage: https://coder-jzq.github.io/

S2LAB Homepage: https://ttslr.github.io/

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
audio		audio
config/DailyTalk		config/DailyTalk
deepspeaker		deepspeaker
experiments		experiments
hifigan		hifigan
lexicon		lexicon
models		models
preprocessed_data		preprocessed_data
preprocessor		preprocessor
text		text
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
evaluate.py		evaluate.py
prepare_align.py		prepare_align.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
synthesize.py		synthesize.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis (III-CSS)

Introduction

Demo Page

Dataset

Pre-trained models

Preprocessing

Training

Inference

Citation

Contact the Author

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Coder-jzq/ICASSP2025-IIICSS

Folders and files

Latest commit

History

Repository files navigation

Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis (III-CSS)

Introduction

Demo Page

Dataset

Pre-trained models

Preprocessing

Training

Inference

Citation

Contact the Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages