Skip to content

Commit a589311

Browse files
committed
Initial commit
0 parents  commit a589311

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+17314
-0
lines changed

.gitignore

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
6+
# C extensions
7+
*.so
8+
9+
# Distribution / packaging
10+
.Python
11+
build/
12+
develop-eggs/
13+
dist/
14+
downloads/
15+
eggs/
16+
.eggs/
17+
lib/
18+
lib64/
19+
parts/
20+
sdist/
21+
var/
22+
wheels/
23+
*.egg-info/
24+
.installed.cfg
25+
*.egg
26+
MANIFEST
27+
28+
# PyInstaller
29+
# Usually these files are written by a python script from a template
30+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
31+
*.manifest
32+
*.spec
33+
34+
# Installer logs
35+
pip-log.txt
36+
pip-delete-this-directory.txt
37+
38+
# Unit test / coverage reports
39+
htmlcov/
40+
.tox/
41+
.coverage
42+
.coverage.*
43+
.cache
44+
nosetests.xml
45+
coverage.xml
46+
*.cover
47+
.hypothesis/
48+
.pytest_cache/
49+
50+
# Translations
51+
*.mo
52+
*.pot
53+
54+
# Django stuff:
55+
*.log
56+
local_settings.py
57+
db.sqlite3
58+
59+
# Flask stuff:
60+
instance/
61+
.webassets-cache
62+
63+
# Scrapy stuff:
64+
.scrapy
65+
66+
# Sphinx documentation
67+
docs/_build/
68+
69+
# PyBuilder
70+
target/
71+
72+
# Jupyter Notebook
73+
.ipynb_checkpoints
74+
75+
# pyenv
76+
.python-version
77+
78+
# celery beat schedule file
79+
celerybeat-schedule
80+
81+
# SageMath parsed files
82+
*.sage.py
83+
84+
# Environments
85+
.env
86+
.venv
87+
env/
88+
venv/
89+
ENV/
90+
env.bak/
91+
venv.bak/
92+
93+
# Spyder project settings
94+
.spyderproject
95+
.spyproject
96+
97+
# Rope project settings
98+
.ropeproject
99+
100+
# mkdocs documentation
101+
/site
102+
103+
# mypy
104+
.mypy_cache/
105+
106+
__pycache__
107+
.vscode
108+
.DS_Store
109+
110+
# MFA
111+
montreal-forced-aligner/
112+
113+
# data, checkpoint, and models
114+
raw_data/
115+
output/
116+
*.npy
117+
TextGrid/
118+
hifigan/*.pth.tar
119+
*.out
120+
fairseq/
121+
soft_dtw_cuda/

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2021 Keon Lee
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
# Parallel Tacotron2
2+
3+
Pytorch Implementation of Google's [Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling](https://arxiv.org/abs/2103.14574)
4+
5+
<p align="center">
6+
<img src="img/parallel_tacotron.png" width="80%">
7+
</p>
8+
9+
<p align="center">
10+
<img src="img/parallel_tacotron2.png" width="40%">
11+
</p>
12+
13+
# Updates
14+
15+
- 2021.05.15: Implementation done. Sanity checks on training and inference. But still the model cannot converge.
16+
17+
`I'm waiting for your contribution!` Please inform me if you find any mistakes in my implementation or any valuable advice to train the model successfully. See the Implementation Issues section.
18+
19+
# Training
20+
21+
## Requirements
22+
23+
- You can install the Python dependencies with
24+
25+
```bash
26+
pip3 install -r requirements.txt
27+
```
28+
29+
- In addition to that, install fairseq ([official document](https://fairseq.readthedocs.io/en/latest/index.html), [github](https://github.com/pytorch/fairseq)) to utilize `LConvBlock`.
30+
31+
## Datasets
32+
33+
The supported datasets:
34+
35+
- [LJSpeech](https://keithito.com/LJ-Speech-Dataset/): a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
36+
- (will be added more)
37+
38+
## Preprocessing
39+
40+
After downloading the datasets, set the `corpus_path` in `preprocess.yaml` and run the preparation script:
41+
42+
```
43+
python3 prepare_data.py config/LJSpeech/preprocess.yaml
44+
```
45+
46+
Then, run the preprocessing script:
47+
48+
```
49+
python3 preprocess.py config/LJSpeech/preprocess.yaml
50+
```
51+
52+
## Training
53+
54+
Train your model with
55+
56+
```
57+
python3 train.py -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
58+
```
59+
60+
The model cannot converge yet. I'm debugging but it would be boosted if your awesome contribution is ready!
61+
62+
# TensorBoard
63+
64+
Use
65+
66+
```
67+
tensorboard --logdir output/log/LJSpeech
68+
```
69+
70+
to serve TensorBoard on your localhost.
71+
72+
# Implementation Issues
73+
74+
Overall, normalization or activation, which is not suggested in the original paper, is adequately arranged to prevent nan value (gradient) on forward and backward calculations.
75+
76+
## Text Encoder
77+
78+
1. Use the `FFTBlock` of FastSpeech2 for the transformer block of the text encoder.
79+
2. Use dropout `0.2` for the `ConvBlock` of the text encoder.
80+
3. To restore "proprietary normalization engine",
81+
- Apply the same text normalization as in FastSpeech2.
82+
- Implement `grapheme_to_phoneme` function. (See ./text/__init__).
83+
84+
## Residual Encoder
85+
86+
1. Use `80 channels` mel-spectrogrom instead of `128-bin`.
87+
2. Regular sinusoidal positional embedding is used in frame-level instead of combinations of three positional embeddings in Parallel Tacotron. As the model depends entirely on unsupervised learning for the position, this choice can be a reason for the fails on model converge.
88+
89+
## Duration Predictor & Learned Upsampling (The most important but ambiguous part)
90+
91+
1. Use log durations with the prior: there should be at least one frame in total per sequence.
92+
2. Use `nn.SiLU()` for the swish activation.
93+
3. When obtaining `W` and `C`, concatenation operation is applied among `S`, `E`, and `V` after frame-domain (T domain) broadcasting of `V`. As the detailed process is not described in the original paper, this choice can be a reason for the fails on model converge.
94+
95+
## Decoder
96+
97+
1. Use (Multi-head) `Self-attention` and `LConvBlock`.
98+
2. Iterative mel-spectrogram is projected by a linear layer.
99+
3. Apply `nn.Tanh()` to each `LConvBLock` output (following activation pattern of decoder part in FastSpeech2).
100+
101+
## Loss
102+
103+
1. Use optimization & scheduler of FastSpeech2 (which is from [Attention is all you need](https://arxiv.org/abs/1706.03762) as described in the original paper).
104+
2. Base on [pytorch-softdtw-cuda](https://github.com/Maghoumi/pytorch-softdtw-cuda) ([post](https://www.codefull.net/2020/05/fast-differentiable-soft-dtw-for-pytorch-using-cuda/)) for the soft-DTW.
105+
1. Implement customized soft-DTW in `model/soft_dtw_cuda.py`, reflecting the recursion suggested in the original paper.
106+
2. In the original soft-DTW, the final loss is not assumed and therefore only `E` is computed. But employed as a loss function, jacobian product is added to return target derivetive of `R` w.r.t. input `X`.
107+
3. Currently, the maximum batch size is `6` in 24GiB GPU (TITAN RTX) due to space complexity problem in soft-DTW Loss.
108+
- In the original paper, a custom differentiable diagonal band operation was implemented and used to solve the complexity of O(T^2), but this part has not been explored in the current implementation yet.
109+
3. For the stability, mel-spectrogroms are compressed by a sigmoid function before the soft-DTW. If the sigmoid is eliminated, the soft-DTW value is too large, producing nan in the backward.
110+
4. Guided attention loss is applied for fast convergence of the attention module in residual encoder.
111+
112+
# Citation
113+
114+
```
115+
@misc{lee2021parallel_tacotron2,
116+
author = {Lee, Keon},
117+
title = {Parallel-Tacotron2},
118+
year = {2021},
119+
publisher = {GitHub},
120+
journal = {GitHub repository},
121+
howpublished = {\url{https://github.com/keonlee9420/Parallel-Tacotron2}}
122+
}
123+
```
124+
125+
# References
126+
127+
- [ming024's FastSpeech2](https://github.com/ming024/FastSpeech2) (Later than 2021.02.26 ver.)
128+
- [Parallel Tacotron: Non-Autoregressive and Controllable TTS](https://arxiv.org/abs/2010.11439)
129+
- [Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling](https://arxiv.org/abs/2103.14574)

audio/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
import audio.tools
2+
import audio.stft
3+
import audio.audio_processing

audio/audio_processing.py

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
import torch
2+
import numpy as np
3+
import librosa.util as librosa_util
4+
from scipy.signal import get_window
5+
6+
7+
def window_sumsquare(
8+
window,
9+
n_frames,
10+
hop_length,
11+
win_length,
12+
n_fft,
13+
dtype=np.float32,
14+
norm=None,
15+
):
16+
"""
17+
# from librosa 0.6
18+
Compute the sum-square envelope of a window function at a given hop length.
19+
20+
This is used to estimate modulation effects induced by windowing
21+
observations in short-time fourier transforms.
22+
23+
Parameters
24+
----------
25+
window : string, tuple, number, callable, or list-like
26+
Window specification, as in `get_window`
27+
28+
n_frames : int > 0
29+
The number of analysis frames
30+
31+
hop_length : int > 0
32+
The number of samples to advance between frames
33+
34+
win_length : [optional]
35+
The length of the window function. By default, this matches `n_fft`.
36+
37+
n_fft : int > 0
38+
The length of each analysis frame.
39+
40+
dtype : np.dtype
41+
The data type of the output
42+
43+
Returns
44+
-------
45+
wss : np.ndarray, shape=`(n_fft + hop_length * (n_frames - 1))`
46+
The sum-squared envelope of the window function
47+
"""
48+
if win_length is None:
49+
win_length = n_fft
50+
51+
n = n_fft + hop_length * (n_frames - 1)
52+
x = np.zeros(n, dtype=dtype)
53+
54+
# Compute the squared window at the desired length
55+
win_sq = get_window(window, win_length, fftbins=True)
56+
win_sq = librosa_util.normalize(win_sq, norm=norm) ** 2
57+
win_sq = librosa_util.pad_center(win_sq, n_fft)
58+
59+
# Fill the envelope
60+
for i in range(n_frames):
61+
sample = i * hop_length
62+
x[sample : min(n, sample + n_fft)] += win_sq[: max(0, min(n_fft, n - sample))]
63+
return x
64+
65+
66+
def griffin_lim(magnitudes, stft_fn, n_iters=30):
67+
"""
68+
PARAMS
69+
------
70+
magnitudes: spectrogram magnitudes
71+
stft_fn: STFT class with transform (STFT) and inverse (ISTFT) methods
72+
"""
73+
74+
angles = np.angle(np.exp(2j * np.pi * np.random.rand(*magnitudes.size())))
75+
angles = angles.astype(np.float32)
76+
angles = torch.autograd.Variable(torch.from_numpy(angles))
77+
signal = stft_fn.inverse(magnitudes, angles).squeeze(1)
78+
79+
for i in range(n_iters):
80+
_, angles = stft_fn.transform(signal)
81+
signal = stft_fn.inverse(magnitudes, angles).squeeze(1)
82+
return signal
83+
84+
85+
def dynamic_range_compression(x, C=1, clip_val=1e-5):
86+
"""
87+
PARAMS
88+
------
89+
C: compression factor
90+
"""
91+
return torch.log(torch.clamp(x, min=clip_val) * C)
92+
93+
94+
def dynamic_range_decompression(x, C=1):
95+
"""
96+
PARAMS
97+
------
98+
C: compression factor used to compress
99+
"""
100+
return torch.exp(x) / C

0 commit comments

Comments
 (0)