|
| 1 | +# Parallel Tacotron2 |
| 2 | + |
| 3 | +Pytorch Implementation of Google's [Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling](https://arxiv.org/abs/2103.14574) |
| 4 | + |
| 5 | +<p align="center"> |
| 6 | + <img src="img/parallel_tacotron.png" width="80%"> |
| 7 | +</p> |
| 8 | + |
| 9 | +<p align="center"> |
| 10 | + <img src="img/parallel_tacotron2.png" width="40%"> |
| 11 | +</p> |
| 12 | + |
| 13 | +# Updates |
| 14 | + |
| 15 | +- 2021.05.15: Implementation done. Sanity checks on training and inference. But still the model cannot converge. |
| 16 | + |
| 17 | + `I'm waiting for your contribution!` Please inform me if you find any mistakes in my implementation or any valuable advice to train the model successfully. See the Implementation Issues section. |
| 18 | + |
| 19 | +# Training |
| 20 | + |
| 21 | +## Requirements |
| 22 | + |
| 23 | +- You can install the Python dependencies with |
| 24 | + |
| 25 | + ```bash |
| 26 | + pip3 install -r requirements.txt |
| 27 | + ``` |
| 28 | + |
| 29 | +- In addition to that, install fairseq ([official document](https://fairseq.readthedocs.io/en/latest/index.html), [github](https://github.com/pytorch/fairseq)) to utilize `LConvBlock`. |
| 30 | + |
| 31 | +## Datasets |
| 32 | + |
| 33 | +The supported datasets: |
| 34 | + |
| 35 | +- [LJSpeech](https://keithito.com/LJ-Speech-Dataset/): a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total. |
| 36 | +- (will be added more) |
| 37 | + |
| 38 | +## Preprocessing |
| 39 | + |
| 40 | +After downloading the datasets, set the `corpus_path` in `preprocess.yaml` and run the preparation script: |
| 41 | + |
| 42 | +``` |
| 43 | +python3 prepare_data.py config/LJSpeech/preprocess.yaml |
| 44 | +``` |
| 45 | +
|
| 46 | +Then, run the preprocessing script: |
| 47 | +
|
| 48 | +``` |
| 49 | +python3 preprocess.py config/LJSpeech/preprocess.yaml |
| 50 | +``` |
| 51 | +
|
| 52 | +## Training |
| 53 | +
|
| 54 | +Train your model with |
| 55 | +
|
| 56 | +``` |
| 57 | +python3 train.py -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml |
| 58 | +``` |
| 59 | +
|
| 60 | +The model cannot converge yet. I'm debugging but it would be boosted if your awesome contribution is ready! |
| 61 | +
|
| 62 | +# TensorBoard |
| 63 | +
|
| 64 | +Use |
| 65 | +
|
| 66 | +``` |
| 67 | +tensorboard --logdir output/log/LJSpeech |
| 68 | +``` |
| 69 | +
|
| 70 | +to serve TensorBoard on your localhost. |
| 71 | +
|
| 72 | +# Implementation Issues |
| 73 | +
|
| 74 | +Overall, normalization or activation, which is not suggested in the original paper, is adequately arranged to prevent nan value (gradient) on forward and backward calculations. |
| 75 | +
|
| 76 | +## Text Encoder |
| 77 | +
|
| 78 | +1. Use the `FFTBlock` of FastSpeech2 for the transformer block of the text encoder. |
| 79 | +2. Use dropout `0.2` for the `ConvBlock` of the text encoder. |
| 80 | +3. To restore "proprietary normalization engine", |
| 81 | + - Apply the same text normalization as in FastSpeech2. |
| 82 | + - Implement `grapheme_to_phoneme` function. (See ./text/__init__). |
| 83 | +
|
| 84 | +## Residual Encoder |
| 85 | +
|
| 86 | +1. Use `80 channels` mel-spectrogrom instead of `128-bin`. |
| 87 | +2. Regular sinusoidal positional embedding is used in frame-level instead of combinations of three positional embeddings in Parallel Tacotron. As the model depends entirely on unsupervised learning for the position, this choice can be a reason for the fails on model converge. |
| 88 | +
|
| 89 | +## Duration Predictor & Learned Upsampling (The most important but ambiguous part) |
| 90 | +
|
| 91 | +1. Use log durations with the prior: there should be at least one frame in total per sequence. |
| 92 | +2. Use `nn.SiLU()` for the swish activation. |
| 93 | +3. When obtaining `W` and `C`, concatenation operation is applied among `S`, `E`, and `V` after frame-domain (T domain) broadcasting of `V`. As the detailed process is not described in the original paper, this choice can be a reason for the fails on model converge. |
| 94 | +
|
| 95 | +## Decoder |
| 96 | +
|
| 97 | +1. Use (Multi-head) `Self-attention` and `LConvBlock`. |
| 98 | +2. Iterative mel-spectrogram is projected by a linear layer. |
| 99 | +3. Apply `nn.Tanh()` to each `LConvBLock` output (following activation pattern of decoder part in FastSpeech2). |
| 100 | +
|
| 101 | +## Loss |
| 102 | +
|
| 103 | +1. Use optimization & scheduler of FastSpeech2 (which is from [Attention is all you need](https://arxiv.org/abs/1706.03762) as described in the original paper). |
| 104 | +2. Base on [pytorch-softdtw-cuda](https://github.com/Maghoumi/pytorch-softdtw-cuda) ([post](https://www.codefull.net/2020/05/fast-differentiable-soft-dtw-for-pytorch-using-cuda/)) for the soft-DTW. |
| 105 | + 1. Implement customized soft-DTW in `model/soft_dtw_cuda.py`, reflecting the recursion suggested in the original paper. |
| 106 | + 2. In the original soft-DTW, the final loss is not assumed and therefore only `E` is computed. But employed as a loss function, jacobian product is added to return target derivetive of `R` w.r.t. input `X`. |
| 107 | + 3. Currently, the maximum batch size is `6` in 24GiB GPU (TITAN RTX) due to space complexity problem in soft-DTW Loss. |
| 108 | + - In the original paper, a custom differentiable diagonal band operation was implemented and used to solve the complexity of O(T^2), but this part has not been explored in the current implementation yet. |
| 109 | +3. For the stability, mel-spectrogroms are compressed by a sigmoid function before the soft-DTW. If the sigmoid is eliminated, the soft-DTW value is too large, producing nan in the backward. |
| 110 | +4. Guided attention loss is applied for fast convergence of the attention module in residual encoder. |
| 111 | +
|
| 112 | +# Citation |
| 113 | +
|
| 114 | +``` |
| 115 | +@misc{lee2021parallel_tacotron2, |
| 116 | + author = {Lee, Keon}, |
| 117 | + title = {Parallel-Tacotron2}, |
| 118 | + year = {2021}, |
| 119 | + publisher = {GitHub}, |
| 120 | + journal = {GitHub repository}, |
| 121 | + howpublished = {\url{https://github.com/keonlee9420/Parallel-Tacotron2}} |
| 122 | +} |
| 123 | +``` |
| 124 | +
|
| 125 | +# References |
| 126 | +
|
| 127 | +- [ming024's FastSpeech2](https://github.com/ming024/FastSpeech2) (Later than 2021.02.26 ver.) |
| 128 | +- [Parallel Tacotron: Non-Autoregressive and Controllable TTS](https://arxiv.org/abs/2010.11439) |
| 129 | +- [Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling](https://arxiv.org/abs/2103.14574) |
0 commit comments