This is an implementation of the following paper. "Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis" (Accepted by ICASSP 2025)
Corresponding Author: Rui Liu
You can download the dataset from DailyTalk.
The Hugging Face URL of Sentence-BERT: https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1
The Hugging Face URL of Wav2Vec2-IEMOCAP: https://huggingface.co/speechbrain/emotion-recognition-wav2vec2-IEMOCAP
Run
python3 prepare_align.py --dataset DailyTalk
for some preparations.
For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Pre-extracted alignments for the datasets are provided here. You have to unzip the files in preprocessed_data/DailyTalk/TextGrid/
. Alternately, you can run the aligner by yourself. Please note that our pretrained models are not trained with supervised duration modeling (they are trained with learn_alignment: True
).
After that, run the preprocessing script by
python3 preprocess.py --dataset DailyTalk
Train III-CSS with
python3 train.py --dataset DailyTalk
Only the batch inference is supported as the generation of a turn may need contextual history of the conversation. Try
python3 synthesize.py --source preprocessed_data/DailyTalk/test_*.txt --restore_step RESTORE_STEP --mode batch --dataset DailyTalk
to synthesize all utterances in preprocessed_data/DailyTalk/test_*.txt
.
If you would like to use our dataset and code or refer to our paper, please cite as follows.
@INPROCEEDINGS{10890216,
author={Jia, Zhenqi and Liu, Rui},
booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis},
year={2025},
volume={},
number={},
pages={1-5},
keywords={Training;Codes;Speech coding;Signal processing;Acoustics;Speech synthesis;History;Context modeling;Conversational Speech Synthesis;Contrastive Learning;Conversational Prosody;Intra-modal Interaction;Inter-modal Interaction},
doi={10.1109/ICASSP49660.2025.10890216}}
E-mail:jiazhenqi7@163.com
Homepage: https://coder-jzq.github.io/
S2LAB Homepage: https://ttslr.github.io/