Skip to content

Commit a7fb06f

Browse files
committed
update README.md
1 parent 464a6d0 commit a7fb06f

File tree

3 files changed

+31
-249
lines changed

3 files changed

+31
-249
lines changed

README.md

Lines changed: 31 additions & 249 deletions
Original file line numberDiff line numberDiff line change
@@ -1,271 +1,53 @@
1-
# Emotional-FastSpeech2 - PyTorch Implementation
1+
# Expressive-FastSpeech2 - PyTorch Implementation
22

33
## Contributions
44

5-
1. **`Non-autoregressive Emotional TTS`**: This project aims to provide a cornerstone for future research and application on a non-autoregressive emotional TTS. For dataset, [AIHub Multimodal Video AI datasets](https://www.aihub.or.kr/aidata/137) and [IEMOCAP database](https://sail.usc.edu/iemocap/) are picked for Korean and English, respectively.
5+
1. **`Non-autoregressive Expressive TTS`**: This project aims to provide a cornerstone for future research and application on a non-autoregressive expressive TTS including `Emotional TTS` and `Conversational TTS`. For datasets, [AIHub Multimodal Video AI datasets](https://www.aihub.or.kr/aidata/137) and [IEMOCAP database](https://sail.usc.edu/iemocap/) are picked for Korean and English, respectively.
66
2. **`Annotated Data Processing`**: This project shed light on how to handle the new dataset, even with a different language, for the successful training of non-autoregressive emotional TTS.
77
3. **`English and Korean TTS`**: In addition to English, this project gives a broad view of treating Korean for the non-autoregressive TTS where the additional data processing must be considered under the language-specific features (e.g., training Montreal Forced Aligner with your own language and dataset). Please closely look into `text/`.
88

9-
## Model Architecture
9+
## Repository Structure
1010

11-
<p align="center">
12-
<img src="img/model.png" width="80%">
13-
</p>
11+
In this project, FastSpeech2 is adapted as a base non-autoregressive multi-speaker TTS framework, so it would be helpful to read [the paper](https://arxiv.org/abs/2006.04558) and [code](https://github.com/ming024/FastSpeech2) first. (Also see [FastSpeech2 branch](https://github.com/keonlee9420/Expressive-FastSpeech2/tree/FastSpeech2))
1412

1513
<p align="center">
16-
<img src="img/model_emotional_tts.png" width="80%">
14+
<img src="img/model.png" width="80%">
1715
</p>
1816

19-
This project follows the basic conditioning paradigm of auxiliary inputs in addition to text input. As presented in [Emotional End-to-End Neural Speech synthesizer](https://arxiv.org/pdf/1711.05447.pdf), emotion embedding is conditioned in utterance level. Based on the dataset, emotion, arousal, and valence are employed for the embedding. They are first projected in subspaces and concatenated channel-wise to keep the dependency among each other. The concatenated embedding is then passed through a single linear layer with ReLU activation for the fusion, comsumed by the decoder to synthesize speech in given emotional conditions. In this project, FastSpeech2 is adapted as a base multi-speaker TTS framework, so it would be helpful to read [the paper](https://arxiv.org/abs/2006.04558) and [code](https://github.com/ming024/FastSpeech2) first. There are two variants of the conditioning method:
20-
21-
- `categorical` branch: only conditioning categorical emotional descriptors (such as happy, sad, etc.)
22-
- `continuous` branch: conditioning continuous emotional descriptors (such as arousal, valence, etc.) in addition to categorical emotional descriptors
23-
24-
# Dependencies
25-
26-
Please install the python dependencies given in `requirements.txt`.
27-
28-
```bash
29-
pip3 install -r requirements.txt
30-
```
31-
32-
# Synthesize Using Pre-trained Model
33-
34-
Not permitted to share pre-trained model publicly due to the copyright of [AIHub Multimodal Video AI datasets](https://www.aihub.or.kr/aidata/137) and [IEMOCAP database](https://sail.usc.edu/iemocap/).
35-
36-
# Train
37-
38-
## Data Preparation
39-
40-
### Korean (Video → Audio)
41-
42-
1. Download [AIHub Multimodal Video AI datasets](https://www.aihub.or.kr/aidata/137) and set `corpus_path` in `config/AIHub-MMV/preprocess.yaml`. You must get the permission to download the dataset.
43-
2. Since the dataset contains raw videos, you need to convert and split each video clip into a audio utterance. For that, the following script will convert files from `.mp4` to `.wav` and then split each clip based on the `.json` file (metadata). It also builds `filelist.txt` and `speaker_info.txt`.
44-
45-
```bash
46-
python3 prepare_data.py --extract_audio -p config/AIHub-MMV/preprocess.yaml
47-
```
48-
49-
3. Update `corpus_path` to the preprocessed data path, e.g., from `AIHub-MMV` to `AIHub-MMV_preprocessed`.
50-
51-
### English (Audio)
52-
53-
1. Download [IEMOCAP database](https://sail.usc.edu/iemocap/) and set `corpus_path` in `config/AIHub-MMV/preprocess.yaml`. You must get the permission to download the dataset.
54-
55-
## Preprocess
56-
57-
### Korean
58-
59-
1. With the prepared dataset, set up some prerequisites. The following command will process the audios and transcripts. The transcripts are normalized to grapheme of Korean by `korean_cleaners` in `/text/cleaners.py`. The results will be located at `raw_path` defined in `config/AIHub-MMV/preprocess.yaml`.
60-
61-
```bash
62-
python3 prepare_align.py config/AIHub-MMV/preprocess.yaml
63-
```
64-
65-
2. As in FastSpeech2, [Montreal Forced Aligner](https://montreal-forced-aligner.readthedocs.io/en/latest/) (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Download and set up the environment to use MFA following the [official document](https://montreal-forced-aligner.readthedocs.io/en/latest/installation.html). The version used in this project is `2.0.0a13`.
66-
67-
You can get alignments by either training MFA from scratch or using pre-trained model. Note that the training MFA may take several hours or days, depending on the corpus size.
68-
69-
### Train MFA from scratch
70-
71-
To train MFA, grapheme-phoneme dictionary that covers all the words in the dataset is required. Following command will generate such dictionary in `lexicon/`.
17+
1. `Emotional TTS`: Following branches contain implementations of the basic paradigm intorduced by [Emotional End-to-End Neural Speech synthesizer](https://arxiv.org/pdf/1711.05447.pdf).
7218

73-
```bash
74-
python3 prepare_data.py --extract_lexicon -p config/AIHub-MMV/preprocess.yaml
75-
```
19+
<p align="center">
20+
<img src="img/model_emotional_tts.png" width="80%">
21+
</p>
7622

77-
After that, train MFA.
23+
- [categorical branch](https://github.com/keonlee9420/Expressive-FastSpeech2/tree/categorical): only conditioning categorical emotional descriptors (such as happy, sad, etc.)
24+
- [continuous branch](https://github.com/keonlee9420/Expressive-FastSpeech2/tree/continuous): conditioning continuous emotional descriptors (such as arousal, valence, etc.) in addition to categorical emotional descriptors
25+
2. `Conversational TTS`: Following branch contains implementation of [Conversational End-to-End TTS for Voice Agent](https://arxiv.org/abs/2005.10438)
7826

79-
```bash
80-
mfa train ./raw_data/AIHub-MMV/clips lexicon/aihub-mmv-lexicon.txt preprocessed_data/AIHub-MMV/TextGrid --output_model_path montreal-forced-aligner/aihub-mmv-aligner --speaker_characters prosodylab -j 8 --clean
81-
```
27+
<p align="center">
28+
<img src="img/model_conversational_tts.png" width="80%">
29+
</p>
8230

83-
It will generates both TextGrid in `preprocessed_data/AIHub-MMV/TextGrid/` and trained models in `montreal-forced-aligner/`. See [official document](https://montreal-forced-aligner.readthedocs.io/en/latest/aligning.html#align-using-only-the-data-set) for the details.
31+
- [conversational branch](https://github.com/keonlee9420/Expressive-FastSpeech2/tree/conversational): conditioning chat history
8432

85-
### Using Pre-trained Models
86-
87-
If you want to re-align the dataset using the extracted lexicon dictionary and trained MFA models from the previous step, run the following command.
88-
89-
```bash
90-
mfa align ./raw_data/AIHub-MMV/clips lexicon/aihub-mmv-lexicon.txt montreal-forced-aligner/aihub-mmv-aligner.zip preprocessed_data/AIHub-MMV/TextGrid --speaker_characters prosodylab -j 8 --clean
91-
```
92-
93-
It will generates TextGrid in `preprocessed_data/AIHub-MMV/TextGrid/`. See [official document](https://montreal-forced-aligner.readthedocs.io/en/latest/aligning.html#align-using-pretrained-models) for the details.
94-
95-
3. Finally, run the preprocessing script. It will extract and save duration, energy, mel-spectrogram, and pitch in `preprocessed_data/AIHub-MMV/` from each audio.
96-
97-
```bash
98-
python3 preprocess.py config/AIHub-MMV/preprocess.yaml
99-
```
100-
101-
### English
102-
103-
1. With the prepared dataset, set up some prerequisites. The following command will process the audios and transcripts. The transcripts are normalized to grapheme of English by `english_cleaners` in `/text/cleaners.py`. The results will be located at `raw_path` defined in `config/IEMOCAP/preprocess.yaml`.
104-
105-
```bash
106-
python3 prepare_align.py config/IEMOCAP/preprocess.yaml
107-
```
108-
109-
2. As in FastSpeech2, [Montreal Forced Aligner](https://montreal-forced-aligner.readthedocs.io/en/latest/) (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Download and set up the environment to use MFA following the [official document](https://montreal-forced-aligner.readthedocs.io/en/latest/installation.html). The version used in this project is `2.0.0a13`.
110-
111-
You can get alignments by either training MFA from scratch or using pre-trained model. Note that the training MFA may take several hours or days, depending on the corpus size.
112-
113-
### Train MFA from scratch
114-
115-
To train MFA, grapheme-phoneme dictionary that covers all the words in the dataset is required. Following command will generate such dictionary in `lexicon/`.
116-
117-
```bash
118-
python3 prepare_data.py --extract_lexicon -p config/IEMOCAP/preprocess.yaml
119-
```
120-
121-
After that, train MFA.
122-
123-
```bash
124-
mfa train ./raw_data/IEMOCAP/sessions lexicon/iemocap-lexicon.txt preprocessed_data/IEMOCAP/TextGrid --output_model_path montreal-forced-aligner/iemocap-aligner --speaker_characters prosodylab -j 8 --clean
125-
```
126-
127-
It will generates both TextGrid in `preprocessed_data/IEMOCAP/TextGrid/` and trained models in `montreal-forced-aligner/`. See [official document](https://montreal-forced-aligner.readthedocs.io/en/latest/aligning.html#align-using-only-the-data-set) for the details.
128-
129-
### Using Pre-trained Models
130-
131-
If you want to re-align the dataset using the extracted lexicon dictionary and trained MFA models from the previous step, run the following command.
132-
133-
```bash
134-
mfa align ./raw_data/IEMOCAP/sessions lexicon/iemocap-lexicon.txt montreal-forced-aligner/iemocap-aligner.zip preprocessed_data/IEMOCAP/TextGrid --speaker_characters prosodylab -j 8 --clean
135-
```
136-
137-
It will generates TextGrid in `preprocessed_data/IEMOCAP/TextGrid/`. See [official document](https://montreal-forced-aligner.readthedocs.io/en/latest/aligning.html#align-using-pretrained-models) for the details.
138-
139-
3. Finally, run the preprocessing script. It will extract and save duration, energy, mel-spectrogram, and pitch in `preprocessed_data/IEMOCAP/` from each audio.
140-
141-
```bash
142-
python3 preprocess.py config/IEMOCAP/preprocess.yaml
143-
```
144-
145-
## Model Training
146-
147-
Now you have all the prerequisites! Train the model using the following command:
148-
149-
### Korean
150-
151-
```bash
152-
python3 train.py -p config/AIHub-MMV/preprocess.yaml -m config/AIHub-MMV/model.yaml -t config/AIHub-MMV/train.yaml
153-
```
33+
## Citation
15434

155-
### English
35+
If you would like to use or refer to this implementation, please cite the repo.
15636

15737
```bash
158-
python3 train.py -p config/IEMOCAP/preprocess.yaml -m config/IEMOCAP/model.yaml -t config/IEMOCAP/train.yaml
38+
@misc{expressive_fastspeech22020,
39+
author = {Lee, Keon},
40+
title = {Expressive-FastSpeech2},
41+
year = {2021},
42+
publisher = {GitHub},
43+
journal = {GitHub repository},
44+
howpublished = {\url{https://github.com/keonlee9420/Expressive-FastSpeech2}}
45+
}
15946
```
16047

161-
# Inference
162-
163-
### Korean
164-
165-
To synthesize a single speech, try
166-
167-
```bash
168-
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --emotion_id EMOTION_ID --arousal AROUSAL --valence VALENCE --restore_step STEP --mode single -p config/AIHub-MMV/preprocess.yaml -m config/AIHub-MMV/model.yaml -t config/AIHub-MMV/train.yaml
169-
```
170-
171-
All ids can be found in dictionary files (json files) in `preprocessed_data/AIHub-MMV/`, and the generated utterances will be put in `output/result/AIHub-MMV`.
172-
173-
Batch inference is also supported, try
174-
175-
```bash
176-
python3 synthesize.py --source preprocessed_data/AIHub-MMV/val.txt --restore_step STEP --mode batch -p config/AIHub-MMV/preprocess.yaml -m config/AIHub-MMV/model.yaml -t config/AIHub-MMV/train.yaml
177-
```
178-
179-
to synthesize all utterances in `preprocessed_data/AIHub-MMV/val.txt`.
180-
181-
### English
182-
183-
To synthesize a single speech, try
184-
185-
```bash
186-
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --emotion_id EMOTION_ID --arousal AROUSAL --valence VALENCE --restore_step STEP --mode single -p config/IEMOCAP/preprocess.yaml -m config/IEMOCAP/model.yaml -t config/IEMOCAP/train.yaml
187-
```
188-
189-
All ids can be found in dictionary files (json files) in `preprocessed_data/IEMOCAP/`, and the generated utterances will be put in `output/result/IEMOCAP`.
190-
191-
Batch inference is also supported, try
192-
193-
```bash
194-
python3 synthesize.py --source preprocessed_data/IEMOCAP/val.txt --restore_step STEP --mode batch -p config/IEMOCAP/preprocess.yaml -m config/IEMOCAP/model.yaml -t config/IEMOCAP/train.yaml
195-
```
196-
197-
to synthesize all utterances in `preprocessed_data/IEMOCAP/val.txt`.
198-
199-
# TensorBoard
200-
201-
Use
202-
203-
```bash
204-
tensorboard --logdir output/log
205-
```
206-
207-
to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.
208-
209-
<p align="center">
210-
<img src="img/emotional-fastspeech2-scalars.png" width="100%">
211-
</p>
212-
213-
<p align="center">
214-
<img src="img/emotional-fastspeech2-images.png" width="100%">
215-
</p>
216-
217-
<p align="center">
218-
<img src="img/emotional-fastspeech2-audios.png" width="100%">
219-
</p>
220-
221-
# Notes
222-
223-
### Implementation Issues
224-
225-
- (For Korean) Since the separator is learned only with 'sp' by the MFA's nature ([official document](https://montreal-forced-aligner.readthedocs.io/en/latest/data_format.html#transcription-normalization-and-dictionary-lookup)), spacing becomes a critical issue. Therefore, after text normalizing, the spacing is polished using the third-party module. The candidates were [PyKoSpacing](https://github.com/haven-jeon/PyKoSpacing) and [QuickSpacer](https://github.com/psj8252/quickspacer), but the latter is selected due to its high accuracy (fewer errors than PyKoSpacing).
226-
- Some incorrect transcriptions can be fixed manually from `preparation/*_fixed.txt` during run of `prepare_align.py`. Even afther that, you can still expand `preparation/*_fixed.txt` with additional corrections and run the following command to apply them. It will update raw text data and `filelist.txt` in `raw_path`, and lexicon dictionary in `lexicon/`.
227-
228-
For korean,
229-
230-
```bash
231-
python3 prepare_data.py --apply_fixed_text -p config/AIHub-MMV/preprocess.yaml
232-
```
233-
234-
For English,
235-
236-
```bash
237-
python3 prepare_data.py --apply_fixed_text -p config/IEMOCAP/preprocess.yaml
238-
```
239-
240-
Note that it should be done after at least once running of `prepare_align.py` and before MFA aligning.
241-
242-
- Also, some incorrect emotion labelings are fixed manually such as out of ranged value for either arousal or valence. These must be updated to build efficient emotion embedding space.
243-
- I emperically found that `TextGrid` extracted from the training process is worsely aligned than that of re-aligned using trained model after the first training. I'm not sure about the reason, but I can confirm that it's better to re-align the dataset using your trained model after finishing the first training especially when there are too many unaligned corpora. And you can also enlarge the `beam` and `retry_beam` following this [issue](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/issues/240#issuecomment-791172411) and [official document](https://montreal-forced-aligner.readthedocs.io/en/latest/configuration_align.html#global-options) to get more aligned corpus with less accuracy.
244-
245-
### Training with your own dataset (own language)
246-
247-
- First, you need to transliterate the dataset by fitting `normalize()` function in `text/korean.py` and dictionary in `text/korean_dict.py`. If you are interested in adapting another language, you may need to prepare a grapheme-to-phoneme convertor of the language.
248-
- Get the files that have the words to be manually checked by following command. Results will be saved at `corpus_path/non*.txt`.
249-
250-
For Korean,
251-
252-
```bash
253-
python3 prepare_data.py --extract_nonkr -p config/AIHub-MMV/preprocess.yaml
254-
```
255-
256-
For English,
257-
258-
```bash
259-
python3 prepare_data.py --extract_nonen -p config/IEMOCAP/preprocess.yaml
260-
```
261-
262-
Based on it, prepare the the correction filelist in `preparation/` just like `*_fixed.txt`.
263-
264-
- Then, follow the Train section start from Preprocess.
265-
266-
# References
48+
## References
26749

268-
* [ming024&#39;s FastSpeech2](https://github.com/ming024/FastSpeech2) (version after 2021.02.26 updates)
269-
* [HGU-DLLAB&#39;s Korean-FastSpeech2-Pytorch](https://github.com/HGU-DLLAB/Korean-FastSpeech2-Pytorch)
270-
* [hccho2&#39;s Tacotron2-Wavenet-Korean-TTS](https://github.com/hccho2/Tacotron2-Wavenet-Korean-TTS)
271-
* [carpedm20&#39; multi-speaker-tacotron-tensorflow](https://github.com/carpedm20/multi-speaker-tacotron-tensorflow)
50+
- [ming024's FastSpeech2](https://github.com/ming024/FastSpeech2) (Later than 2020.02.26 ver.)
51+
- [HGU-DLLAB's Korean-FastSpeech2-Pytorch](https://github.com/HGU-DLLAB/Korean-FastSpeech2-Pytorch)
52+
- [hccho2's Tacotron2-Wavenet-Korean-TTS](https://github.com/hccho2/Tacotron2-Wavenet-Korean-TTS)
53+
- [carpedm20' multi-speaker-tacotron-tensorflow](https://github.com/carpedm20/multi-speaker-tacotron-tensorflow)

img/model_conversational_tts.png

89.2 KB
Loading
86.7 KB
Loading

0 commit comments

Comments
 (0)