|
1 |
| -# Emotional-FastSpeech2 - PyTorch Implementation |
| 1 | +# Expressive-FastSpeech2 - PyTorch Implementation |
2 | 2 |
|
3 | 3 | ## Contributions
|
4 | 4 |
|
5 |
| -1. **`Non-autoregressive Emotional TTS`**: This project aims to provide a cornerstone for future research and application on a non-autoregressive emotional TTS. For dataset, [AIHub Multimodal Video AI datasets](https://www.aihub.or.kr/aidata/137) and [IEMOCAP database](https://sail.usc.edu/iemocap/) are picked for Korean and English, respectively. |
| 5 | +1. **`Non-autoregressive Expressive TTS`**: This project aims to provide a cornerstone for future research and application on a non-autoregressive expressive TTS including `Emotional TTS` and `Conversational TTS`. For datasets, [AIHub Multimodal Video AI datasets](https://www.aihub.or.kr/aidata/137) and [IEMOCAP database](https://sail.usc.edu/iemocap/) are picked for Korean and English, respectively. |
6 | 6 | 2. **`Annotated Data Processing`**: This project shed light on how to handle the new dataset, even with a different language, for the successful training of non-autoregressive emotional TTS.
|
7 | 7 | 3. **`English and Korean TTS`**: In addition to English, this project gives a broad view of treating Korean for the non-autoregressive TTS where the additional data processing must be considered under the language-specific features (e.g., training Montreal Forced Aligner with your own language and dataset). Please closely look into `text/`.
|
8 | 8 |
|
9 |
| -## Model Architecture |
| 9 | +## Repository Structure |
10 | 10 |
|
11 |
| -<p align="center"> |
12 |
| - <img src="img/model.png" width="80%"> |
13 |
| -</p> |
| 11 | +In this project, FastSpeech2 is adapted as a base non-autoregressive multi-speaker TTS framework, so it would be helpful to read [the paper](https://arxiv.org/abs/2006.04558) and [code](https://github.com/ming024/FastSpeech2) first. (Also see [FastSpeech2 branch](https://github.com/keonlee9420/Expressive-FastSpeech2/tree/FastSpeech2)) |
14 | 12 |
|
15 | 13 | <p align="center">
|
16 |
| - <img src="img/model_emotional_tts.png" width="80%"> |
| 14 | + <img src="img/model.png" width="80%"> |
17 | 15 | </p>
|
18 | 16 |
|
19 |
| -This project follows the basic conditioning paradigm of auxiliary inputs in addition to text input. As presented in [Emotional End-to-End Neural Speech synthesizer](https://arxiv.org/pdf/1711.05447.pdf), emotion embedding is conditioned in utterance level. Based on the dataset, emotion, arousal, and valence are employed for the embedding. They are first projected in subspaces and concatenated channel-wise to keep the dependency among each other. The concatenated embedding is then passed through a single linear layer with ReLU activation for the fusion, comsumed by the decoder to synthesize speech in given emotional conditions. In this project, FastSpeech2 is adapted as a base multi-speaker TTS framework, so it would be helpful to read [the paper](https://arxiv.org/abs/2006.04558) and [code](https://github.com/ming024/FastSpeech2) first. There are two variants of the conditioning method: |
20 |
| - |
21 |
| -- `categorical` branch: only conditioning categorical emotional descriptors (such as happy, sad, etc.) |
22 |
| -- `continuous` branch: conditioning continuous emotional descriptors (such as arousal, valence, etc.) in addition to categorical emotional descriptors |
23 |
| - |
24 |
| -# Dependencies |
25 |
| - |
26 |
| -Please install the python dependencies given in `requirements.txt`. |
27 |
| - |
28 |
| -```bash |
29 |
| -pip3 install -r requirements.txt |
30 |
| -``` |
31 |
| - |
32 |
| -# Synthesize Using Pre-trained Model |
33 |
| - |
34 |
| -Not permitted to share pre-trained model publicly due to the copyright of [AIHub Multimodal Video AI datasets](https://www.aihub.or.kr/aidata/137) and [IEMOCAP database](https://sail.usc.edu/iemocap/). |
35 |
| - |
36 |
| -# Train |
37 |
| - |
38 |
| -## Data Preparation |
39 |
| - |
40 |
| -### Korean (Video → Audio) |
41 |
| - |
42 |
| -1. Download [AIHub Multimodal Video AI datasets](https://www.aihub.or.kr/aidata/137) and set `corpus_path` in `config/AIHub-MMV/preprocess.yaml`. You must get the permission to download the dataset. |
43 |
| -2. Since the dataset contains raw videos, you need to convert and split each video clip into a audio utterance. For that, the following script will convert files from `.mp4` to `.wav` and then split each clip based on the `.json` file (metadata). It also builds `filelist.txt` and `speaker_info.txt`. |
44 |
| - |
45 |
| - ```bash |
46 |
| - python3 prepare_data.py --extract_audio -p config/AIHub-MMV/preprocess.yaml |
47 |
| - ``` |
48 |
| - |
49 |
| -3. Update `corpus_path` to the preprocessed data path, e.g., from `AIHub-MMV` to `AIHub-MMV_preprocessed`. |
50 |
| - |
51 |
| -### English (Audio) |
52 |
| - |
53 |
| -1. Download [IEMOCAP database](https://sail.usc.edu/iemocap/) and set `corpus_path` in `config/AIHub-MMV/preprocess.yaml`. You must get the permission to download the dataset. |
54 |
| - |
55 |
| -## Preprocess |
56 |
| - |
57 |
| -### Korean |
58 |
| - |
59 |
| -1. With the prepared dataset, set up some prerequisites. The following command will process the audios and transcripts. The transcripts are normalized to grapheme of Korean by `korean_cleaners` in `/text/cleaners.py`. The results will be located at `raw_path` defined in `config/AIHub-MMV/preprocess.yaml`. |
60 |
| - |
61 |
| - ```bash |
62 |
| - python3 prepare_align.py config/AIHub-MMV/preprocess.yaml |
63 |
| - ``` |
64 |
| - |
65 |
| -2. As in FastSpeech2, [Montreal Forced Aligner](https://montreal-forced-aligner.readthedocs.io/en/latest/) (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Download and set up the environment to use MFA following the [official document](https://montreal-forced-aligner.readthedocs.io/en/latest/installation.html). The version used in this project is `2.0.0a13`. |
66 |
| - |
67 |
| - You can get alignments by either training MFA from scratch or using pre-trained model. Note that the training MFA may take several hours or days, depending on the corpus size. |
68 |
| - |
69 |
| - ### Train MFA from scratch |
70 |
| - |
71 |
| - To train MFA, grapheme-phoneme dictionary that covers all the words in the dataset is required. Following command will generate such dictionary in `lexicon/`. |
| 17 | +1. `Emotional TTS`: Following branches contain implementations of the basic paradigm intorduced by [Emotional End-to-End Neural Speech synthesizer](https://arxiv.org/pdf/1711.05447.pdf). |
72 | 18 |
|
73 |
| - ```bash |
74 |
| - python3 prepare_data.py --extract_lexicon -p config/AIHub-MMV/preprocess.yaml |
75 |
| - ``` |
| 19 | + <p align="center"> |
| 20 | + <img src="img/model_emotional_tts.png" width="80%"> |
| 21 | + </p> |
76 | 22 |
|
77 |
| - After that, train MFA. |
| 23 | + - [categorical branch](https://github.com/keonlee9420/Expressive-FastSpeech2/tree/categorical): only conditioning categorical emotional descriptors (such as happy, sad, etc.) |
| 24 | + - [continuous branch](https://github.com/keonlee9420/Expressive-FastSpeech2/tree/continuous): conditioning continuous emotional descriptors (such as arousal, valence, etc.) in addition to categorical emotional descriptors |
| 25 | +2. `Conversational TTS`: Following branch contains implementation of [Conversational End-to-End TTS for Voice Agent](https://arxiv.org/abs/2005.10438) |
78 | 26 |
|
79 |
| - ```bash |
80 |
| - mfa train ./raw_data/AIHub-MMV/clips lexicon/aihub-mmv-lexicon.txt preprocessed_data/AIHub-MMV/TextGrid --output_model_path montreal-forced-aligner/aihub-mmv-aligner --speaker_characters prosodylab -j 8 --clean |
81 |
| - ``` |
| 27 | + <p align="center"> |
| 28 | + <img src="img/model_conversational_tts.png" width="80%"> |
| 29 | + </p> |
82 | 30 |
|
83 |
| - It will generates both TextGrid in `preprocessed_data/AIHub-MMV/TextGrid/` and trained models in `montreal-forced-aligner/`. See [official document](https://montreal-forced-aligner.readthedocs.io/en/latest/aligning.html#align-using-only-the-data-set) for the details. |
| 31 | + - [conversational branch](https://github.com/keonlee9420/Expressive-FastSpeech2/tree/conversational): conditioning chat history |
84 | 32 |
|
85 |
| - ### Using Pre-trained Models |
86 |
| - |
87 |
| - If you want to re-align the dataset using the extracted lexicon dictionary and trained MFA models from the previous step, run the following command. |
88 |
| - |
89 |
| - ```bash |
90 |
| - mfa align ./raw_data/AIHub-MMV/clips lexicon/aihub-mmv-lexicon.txt montreal-forced-aligner/aihub-mmv-aligner.zip preprocessed_data/AIHub-MMV/TextGrid --speaker_characters prosodylab -j 8 --clean |
91 |
| - ``` |
92 |
| - |
93 |
| - It will generates TextGrid in `preprocessed_data/AIHub-MMV/TextGrid/`. See [official document](https://montreal-forced-aligner.readthedocs.io/en/latest/aligning.html#align-using-pretrained-models) for the details. |
94 |
| - |
95 |
| -3. Finally, run the preprocessing script. It will extract and save duration, energy, mel-spectrogram, and pitch in `preprocessed_data/AIHub-MMV/` from each audio. |
96 |
| - |
97 |
| - ```bash |
98 |
| - python3 preprocess.py config/AIHub-MMV/preprocess.yaml |
99 |
| - ``` |
100 |
| - |
101 |
| -### English |
102 |
| - |
103 |
| -1. With the prepared dataset, set up some prerequisites. The following command will process the audios and transcripts. The transcripts are normalized to grapheme of English by `english_cleaners` in `/text/cleaners.py`. The results will be located at `raw_path` defined in `config/IEMOCAP/preprocess.yaml`. |
104 |
| - |
105 |
| - ```bash |
106 |
| - python3 prepare_align.py config/IEMOCAP/preprocess.yaml |
107 |
| - ``` |
108 |
| - |
109 |
| -2. As in FastSpeech2, [Montreal Forced Aligner](https://montreal-forced-aligner.readthedocs.io/en/latest/) (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Download and set up the environment to use MFA following the [official document](https://montreal-forced-aligner.readthedocs.io/en/latest/installation.html). The version used in this project is `2.0.0a13`. |
110 |
| - |
111 |
| - You can get alignments by either training MFA from scratch or using pre-trained model. Note that the training MFA may take several hours or days, depending on the corpus size. |
112 |
| - |
113 |
| - ### Train MFA from scratch |
114 |
| - |
115 |
| - To train MFA, grapheme-phoneme dictionary that covers all the words in the dataset is required. Following command will generate such dictionary in `lexicon/`. |
116 |
| - |
117 |
| - ```bash |
118 |
| - python3 prepare_data.py --extract_lexicon -p config/IEMOCAP/preprocess.yaml |
119 |
| - ``` |
120 |
| - |
121 |
| - After that, train MFA. |
122 |
| - |
123 |
| - ```bash |
124 |
| - mfa train ./raw_data/IEMOCAP/sessions lexicon/iemocap-lexicon.txt preprocessed_data/IEMOCAP/TextGrid --output_model_path montreal-forced-aligner/iemocap-aligner --speaker_characters prosodylab -j 8 --clean |
125 |
| - ``` |
126 |
| - |
127 |
| - It will generates both TextGrid in `preprocessed_data/IEMOCAP/TextGrid/` and trained models in `montreal-forced-aligner/`. See [official document](https://montreal-forced-aligner.readthedocs.io/en/latest/aligning.html#align-using-only-the-data-set) for the details. |
128 |
| - |
129 |
| - ### Using Pre-trained Models |
130 |
| - |
131 |
| - If you want to re-align the dataset using the extracted lexicon dictionary and trained MFA models from the previous step, run the following command. |
132 |
| - |
133 |
| - ```bash |
134 |
| - mfa align ./raw_data/IEMOCAP/sessions lexicon/iemocap-lexicon.txt montreal-forced-aligner/iemocap-aligner.zip preprocessed_data/IEMOCAP/TextGrid --speaker_characters prosodylab -j 8 --clean |
135 |
| - ``` |
136 |
| - |
137 |
| - It will generates TextGrid in `preprocessed_data/IEMOCAP/TextGrid/`. See [official document](https://montreal-forced-aligner.readthedocs.io/en/latest/aligning.html#align-using-pretrained-models) for the details. |
138 |
| - |
139 |
| -3. Finally, run the preprocessing script. It will extract and save duration, energy, mel-spectrogram, and pitch in `preprocessed_data/IEMOCAP/` from each audio. |
140 |
| - |
141 |
| - ```bash |
142 |
| - python3 preprocess.py config/IEMOCAP/preprocess.yaml |
143 |
| - ``` |
144 |
| - |
145 |
| -## Model Training |
146 |
| - |
147 |
| -Now you have all the prerequisites! Train the model using the following command: |
148 |
| - |
149 |
| -### Korean |
150 |
| - |
151 |
| -```bash |
152 |
| -python3 train.py -p config/AIHub-MMV/preprocess.yaml -m config/AIHub-MMV/model.yaml -t config/AIHub-MMV/train.yaml |
153 |
| -``` |
| 33 | +## Citation |
154 | 34 |
|
155 |
| -### English |
| 35 | +If you would like to use or refer to this implementation, please cite the repo. |
156 | 36 |
|
157 | 37 | ```bash
|
158 |
| -python3 train.py -p config/IEMOCAP/preprocess.yaml -m config/IEMOCAP/model.yaml -t config/IEMOCAP/train.yaml |
| 38 | +@misc{expressive_fastspeech22020, |
| 39 | + author = {Lee, Keon}, |
| 40 | + title = {Expressive-FastSpeech2}, |
| 41 | + year = {2021}, |
| 42 | + publisher = {GitHub}, |
| 43 | + journal = {GitHub repository}, |
| 44 | + howpublished = {\url{https://github.com/keonlee9420/Expressive-FastSpeech2}} |
| 45 | +} |
159 | 46 | ```
|
160 | 47 |
|
161 |
| -# Inference |
162 |
| - |
163 |
| -### Korean |
164 |
| - |
165 |
| -To synthesize a single speech, try |
166 |
| - |
167 |
| -```bash |
168 |
| -python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --emotion_id EMOTION_ID --arousal AROUSAL --valence VALENCE --restore_step STEP --mode single -p config/AIHub-MMV/preprocess.yaml -m config/AIHub-MMV/model.yaml -t config/AIHub-MMV/train.yaml |
169 |
| -``` |
170 |
| - |
171 |
| -All ids can be found in dictionary files (json files) in `preprocessed_data/AIHub-MMV/`, and the generated utterances will be put in `output/result/AIHub-MMV`. |
172 |
| - |
173 |
| -Batch inference is also supported, try |
174 |
| - |
175 |
| -```bash |
176 |
| -python3 synthesize.py --source preprocessed_data/AIHub-MMV/val.txt --restore_step STEP --mode batch -p config/AIHub-MMV/preprocess.yaml -m config/AIHub-MMV/model.yaml -t config/AIHub-MMV/train.yaml |
177 |
| -``` |
178 |
| - |
179 |
| -to synthesize all utterances in `preprocessed_data/AIHub-MMV/val.txt`. |
180 |
| - |
181 |
| -### English |
182 |
| - |
183 |
| -To synthesize a single speech, try |
184 |
| - |
185 |
| -```bash |
186 |
| -python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --emotion_id EMOTION_ID --arousal AROUSAL --valence VALENCE --restore_step STEP --mode single -p config/IEMOCAP/preprocess.yaml -m config/IEMOCAP/model.yaml -t config/IEMOCAP/train.yaml |
187 |
| -``` |
188 |
| - |
189 |
| -All ids can be found in dictionary files (json files) in `preprocessed_data/IEMOCAP/`, and the generated utterances will be put in `output/result/IEMOCAP`. |
190 |
| - |
191 |
| -Batch inference is also supported, try |
192 |
| - |
193 |
| -```bash |
194 |
| -python3 synthesize.py --source preprocessed_data/IEMOCAP/val.txt --restore_step STEP --mode batch -p config/IEMOCAP/preprocess.yaml -m config/IEMOCAP/model.yaml -t config/IEMOCAP/train.yaml |
195 |
| -``` |
196 |
| - |
197 |
| -to synthesize all utterances in `preprocessed_data/IEMOCAP/val.txt`. |
198 |
| - |
199 |
| -# TensorBoard |
200 |
| - |
201 |
| -Use |
202 |
| - |
203 |
| -```bash |
204 |
| -tensorboard --logdir output/log |
205 |
| -``` |
206 |
| - |
207 |
| -to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown. |
208 |
| - |
209 |
| -<p align="center"> |
210 |
| - <img src="img/emotional-fastspeech2-scalars.png" width="100%"> |
211 |
| -</p> |
212 |
| - |
213 |
| -<p align="center"> |
214 |
| - <img src="img/emotional-fastspeech2-images.png" width="100%"> |
215 |
| -</p> |
216 |
| - |
217 |
| -<p align="center"> |
218 |
| - <img src="img/emotional-fastspeech2-audios.png" width="100%"> |
219 |
| -</p> |
220 |
| - |
221 |
| -# Notes |
222 |
| - |
223 |
| -### Implementation Issues |
224 |
| - |
225 |
| -- (For Korean) Since the separator is learned only with 'sp' by the MFA's nature ([official document](https://montreal-forced-aligner.readthedocs.io/en/latest/data_format.html#transcription-normalization-and-dictionary-lookup)), spacing becomes a critical issue. Therefore, after text normalizing, the spacing is polished using the third-party module. The candidates were [PyKoSpacing](https://github.com/haven-jeon/PyKoSpacing) and [QuickSpacer](https://github.com/psj8252/quickspacer), but the latter is selected due to its high accuracy (fewer errors than PyKoSpacing). |
226 |
| -- Some incorrect transcriptions can be fixed manually from `preparation/*_fixed.txt` during run of `prepare_align.py`. Even afther that, you can still expand `preparation/*_fixed.txt` with additional corrections and run the following command to apply them. It will update raw text data and `filelist.txt` in `raw_path`, and lexicon dictionary in `lexicon/`. |
227 |
| -
|
228 |
| - For korean, |
229 |
| -
|
230 |
| - ```bash |
231 |
| - python3 prepare_data.py --apply_fixed_text -p config/AIHub-MMV/preprocess.yaml |
232 |
| - ``` |
233 |
| -
|
234 |
| - For English, |
235 |
| -
|
236 |
| - ```bash |
237 |
| - python3 prepare_data.py --apply_fixed_text -p config/IEMOCAP/preprocess.yaml |
238 |
| - ``` |
239 |
| -
|
240 |
| - Note that it should be done after at least once running of `prepare_align.py` and before MFA aligning. |
241 |
| -
|
242 |
| -- Also, some incorrect emotion labelings are fixed manually such as out of ranged value for either arousal or valence. These must be updated to build efficient emotion embedding space. |
243 |
| -- I emperically found that `TextGrid` extracted from the training process is worsely aligned than that of re-aligned using trained model after the first training. I'm not sure about the reason, but I can confirm that it's better to re-align the dataset using your trained model after finishing the first training especially when there are too many unaligned corpora. And you can also enlarge the `beam` and `retry_beam` following this [issue](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/issues/240#issuecomment-791172411) and [official document](https://montreal-forced-aligner.readthedocs.io/en/latest/configuration_align.html#global-options) to get more aligned corpus with less accuracy. |
244 |
| -
|
245 |
| -### Training with your own dataset (own language) |
246 |
| -
|
247 |
| -- First, you need to transliterate the dataset by fitting `normalize()` function in `text/korean.py` and dictionary in `text/korean_dict.py`. If you are interested in adapting another language, you may need to prepare a grapheme-to-phoneme convertor of the language. |
248 |
| -- Get the files that have the words to be manually checked by following command. Results will be saved at `corpus_path/non*.txt`. |
249 |
| -
|
250 |
| - For Korean, |
251 |
| -
|
252 |
| - ```bash |
253 |
| - python3 prepare_data.py --extract_nonkr -p config/AIHub-MMV/preprocess.yaml |
254 |
| - ``` |
255 |
| -
|
256 |
| - For English, |
257 |
| -
|
258 |
| - ```bash |
259 |
| - python3 prepare_data.py --extract_nonen -p config/IEMOCAP/preprocess.yaml |
260 |
| - ``` |
261 |
| -
|
262 |
| - Based on it, prepare the the correction filelist in `preparation/` just like `*_fixed.txt`. |
263 |
| -
|
264 |
| -- Then, follow the Train section start from Preprocess. |
265 |
| -
|
266 |
| -# References |
| 48 | +## References |
267 | 49 |
|
268 |
| -* [ming024's FastSpeech2](https://github.com/ming024/FastSpeech2) (version after 2021.02.26 updates) |
269 |
| -* [HGU-DLLAB's Korean-FastSpeech2-Pytorch](https://github.com/HGU-DLLAB/Korean-FastSpeech2-Pytorch) |
270 |
| -* [hccho2's Tacotron2-Wavenet-Korean-TTS](https://github.com/hccho2/Tacotron2-Wavenet-Korean-TTS) |
271 |
| -* [carpedm20' multi-speaker-tacotron-tensorflow](https://github.com/carpedm20/multi-speaker-tacotron-tensorflow) |
| 50 | +- [ming024's FastSpeech2](https://github.com/ming024/FastSpeech2) (Later than 2020.02.26 ver.) |
| 51 | +- [HGU-DLLAB's Korean-FastSpeech2-Pytorch](https://github.com/HGU-DLLAB/Korean-FastSpeech2-Pytorch) |
| 52 | +- [hccho2's Tacotron2-Wavenet-Korean-TTS](https://github.com/hccho2/Tacotron2-Wavenet-Korean-TTS) |
| 53 | +- [carpedm20' multi-speaker-tacotron-tensorflow](https://github.com/carpedm20/multi-speaker-tacotron-tensorflow) |
0 commit comments