You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+10-9Lines changed: 10 additions & 9 deletions
Original file line number
Diff line number
Diff line change
@@ -24,7 +24,7 @@ A notebook supposed to be executed on https://colab.research.google.com is avail
24
24
- Multi-speaker and single speaker versions of DeepVoice3
25
25
- Audio samples and pre-trained models
26
26
- Preprocessor for [LJSpeech (en)](https://keithito.com/LJ-Speech-Dataset/), [JSUT (jp)](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and [VCTK](http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html) datasets, as well as [carpedm20/multi-speaker-tacotron-tensorflow](https://github.com/carpedm20/multi-Speaker-tacotron-tensorflow) compatible custom dataset (in JSON format)
27
-
- Language-dependent frontend text processor for English and Japanese
27
+
- Language-dependent frontend text processor for English and Japanese
28
28
29
29
### Samples
30
30
@@ -61,6 +61,7 @@ See "Synthesize from a checkpoint" section in the README for how to generate spe
#### 1-2. Preprocessing custom english datasets with long silence. (Based on [vctk_preprocess](vctk_preprocess/))
149
150
150
-
Some dataset, especially automatically generated dataset may include long silence and undesirable leading/trailing noises, undermining the char-level seq2seq model.
151
+
Some dataset, especially automatically generated dataset may include long silence and undesirable leading/trailing noises, undermining the char-level seq2seq model.
151
152
(e.g. VCTK, although this is covered in vctk_preprocess)
152
153
153
154
To deal with the problem, `gentle_web_align.py` will
154
-
-**Prepare phoneme alignments for all utterances**
155
-
- Cut silences during preprocessing
155
+
-**Prepare phoneme alignments for all utterances**
156
+
- Cut silences during preprocessing
156
157
157
-
`gentle_web_align.py` uses [Gentle](https://github.com/lowerquality/gentle), a kaldi based speech-text alignment tool. This accesses web-served Gentle application, aligns given sound segments with transcripts and converts the result to HTK-style label files, to be processed in `preprocess.py`. Gentle can be run in Linux/Mac/Windows(via Docker).
158
+
`gentle_web_align.py` uses [Gentle](https://github.com/lowerquality/gentle), a kaldi based speech-text alignment tool. This accesses web-served Gentle application, aligns given sound segments with transcripts and converts the result to HTK-style label files, to be processed in `preprocess.py`. Gentle can be run in Linux/Mac/Windows(via Docker).
158
159
159
160
Preliminary results show that while HTK/festival/merlin-based method in `vctk_preprocess/prepare_vctk_labels.py` works better on VCTK, Gentle is more stable with audio clips with ambient noise. (e.g. movie excerpts)
160
161
@@ -182,7 +183,7 @@ python train.py --data-root=${data-root} --preset=<json> --hparams="parameters y
182
183
Suppose you build a DeepVoice3-style model using LJSpeech dataset, then you can train your model by:
Model checkpoints (.pth) and alignments (.png) are saved in `./checkpoints` directory per 10000 steps by default.
@@ -290,9 +291,9 @@ From my experience, it can get reasonable speech quality very quickly rather tha
290
291
There are two important options used above:
291
292
292
293
-`--restore-parts=<N>`: It specifies where to load model parameters. The differences from the option `--checkpoint=<N>` are 1) `--restore-parts=<N>` ignores all invalid parameters, while `--checkpoint=<N>` doesn't. 2) `--restore-parts=<N>` tell trainer to start from 0-step, while `--checkpoint=<N>` tell trainer to continue from last step. `--checkpoint=<N>` should be ok if you are using exactly same model and continue to train, but it would be useful if you want to customize your model architecture and take advantages of pre-trained model.
293
-
-`--speaker-id=<N>`: It specifies what speaker of data is used for training. This should only be specified if you are using multi-speaker dataset. As for VCTK, speaker id is automatically assigned incrementally (0, 1, ..., 107) according to the `speaker_info.txt` in the dataset.
294
+
-`--speaker-id=<N>`: It specifies what speaker of data is used for training. This should only be specified if you are using multi-speaker dataset. As for VCTK, speaker id is automatically assigned incrementally (0, 1, ..., 107) according to the `speaker_info.txt` in the dataset.
294
295
295
-
If you are training multi-speaker model, speaker adaptation will only work **when `n_speakers` is identical**.
296
+
If you are training multi-speaker model, speaker adaptation will only work **when `n_speakers` is identical**.
Copy file name to clipboardExpand all lines: hparams.py
+7-4Lines changed: 7 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -99,6 +99,7 @@
99
99
adam_beta1=0.5,
100
100
adam_beta2=0.9,
101
101
adam_eps=1e-6,
102
+
amsgrad=False,
102
103
initial_learning_rate=5e-4, # 0.001,
103
104
lr_schedule="noam_learning_rate_decay",
104
105
lr_schedule_kwargs={},
@@ -125,14 +126,16 @@
125
126
# Forced garbage collection probability
126
127
# Use only when MemoryError continues in Windows (Disabled by default)
127
128
#gc_probability = 0.001,
128
-
129
+
129
130
# json_meta mode only
130
131
# 0: "use all",
131
132
# 1: "ignore only unmatched_alignment",
132
133
# 2: "fully ignore recognition",
133
-
ignore_recognition_level=2,
134
-
min_text=20, # when dealing with non-dedicated speech dataset(e.g. movie excerpts), setting min_text above 15 is desirable. Can be adjusted by dataset.
135
-
process_only_htk_aligned=False, # if true, data without phoneme alignment file(.lab) will be ignored
134
+
ignore_recognition_level=2,
135
+
# when dealing with non-dedicated speech dataset(e.g. movie excerpts), setting min_text above 15 is desirable. Can be adjusted by dataset.
136
+
min_text=20,
137
+
# if true, data without phoneme alignment file(.lab) will be ignored
0 commit comments