Skip to content

Commit 2b58330

Browse files
committed
pytorch >= v0.4
1 parent b0a2e3e commit 2b58330

File tree

1 file changed

+10
-9
lines changed

1 file changed

+10
-9
lines changed

README.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ A notebook supposed to be executed on https://colab.research.google.com is avail
2424
- Multi-speaker and single speaker versions of DeepVoice3
2525
- Audio samples and pre-trained models
2626
- Preprocessor for [LJSpeech (en)](https://keithito.com/LJ-Speech-Dataset/), [JSUT (jp)](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and [VCTK](http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html) datasets, as well as [carpedm20/multi-speaker-tacotron-tensorflow](https://github.com/carpedm20/multi-Speaker-tacotron-tensorflow) compatible custom dataset (in JSON format)
27-
- Language-dependent frontend text processor for English and Japanese
27+
- Language-dependent frontend text processor for English and Japanese
2828

2929
### Samples
3030

@@ -61,6 +61,7 @@ See "Synthesize from a checkpoint" section in the README for how to generate spe
6161

6262
- Python 3
6363
- CUDA >= 8.0
64+
- PyTorch >= v0.4.0
6465
- TensorFlow >= v1.3
6566
- [nnmnkwii](https://github.com/r9y9/nnmnkwii) >= v0.0.11
6667
- [MeCab](http://taku910.github.io/mecab/) (Japanese only)
@@ -104,7 +105,7 @@ python train.py --preset=presets/deepvoice3_ljspeech.json --data-root=./data/ljs
104105
- LJSpeech (en): https://keithito.com/LJ-Speech-Dataset/
105106
- VCTK (en): http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html
106107
- JSUT (jp): https://sites.google.com/site/shinnosuketakamichi/publication/jsut
107-
- NIKL (ko) (**Need korean cellphone number to access it**): http://www.korean.go.kr/front/board/boardStandardView.do?board_id=4&mn_id=17&b_seq=464
108+
- NIKL (ko) (**Need korean cellphone number to access it**): http://www.korean.go.kr/front/board/boardStandardView.do?board_id=4&mn_id=17&b_seq=464
108109

109110
### 1. Preprocessing
110111

@@ -147,14 +148,14 @@ python preprocess.py json_meta "./datasets/datasetA/alignment.json,./datasets/da
147148

148149
#### 1-2. Preprocessing custom english datasets with long silence. (Based on [vctk_preprocess](vctk_preprocess/))
149150

150-
Some dataset, especially automatically generated dataset may include long silence and undesirable leading/trailing noises, undermining the char-level seq2seq model.
151+
Some dataset, especially automatically generated dataset may include long silence and undesirable leading/trailing noises, undermining the char-level seq2seq model.
151152
(e.g. VCTK, although this is covered in vctk_preprocess)
152153

153154
To deal with the problem, `gentle_web_align.py` will
154-
- **Prepare phoneme alignments for all utterances**
155-
- Cut silences during preprocessing
155+
- **Prepare phoneme alignments for all utterances**
156+
- Cut silences during preprocessing
156157

157-
`gentle_web_align.py` uses [Gentle](https://github.com/lowerquality/gentle), a kaldi based speech-text alignment tool. This accesses web-served Gentle application, aligns given sound segments with transcripts and converts the result to HTK-style label files, to be processed in `preprocess.py`. Gentle can be run in Linux/Mac/Windows(via Docker).
158+
`gentle_web_align.py` uses [Gentle](https://github.com/lowerquality/gentle), a kaldi based speech-text alignment tool. This accesses web-served Gentle application, aligns given sound segments with transcripts and converts the result to HTK-style label files, to be processed in `preprocess.py`. Gentle can be run in Linux/Mac/Windows(via Docker).
158159

159160
Preliminary results show that while HTK/festival/merlin-based method in `vctk_preprocess/prepare_vctk_labels.py` works better on VCTK, Gentle is more stable with audio clips with ambient noise. (e.g. movie excerpts)
160161

@@ -182,7 +183,7 @@ python train.py --data-root=${data-root} --preset=<json> --hparams="parameters y
182183
Suppose you build a DeepVoice3-style model using LJSpeech dataset, then you can train your model by:
183184

184185
```
185-
python train.py --preset=presets/deepvoice3_ljspeech.json --data-root=./data/ljspeech/
186+
python train.py --preset=presets/deepvoice3_ljspeech.json --data-root=./data/ljspeech/
186187
```
187188

188189
Model checkpoints (.pth) and alignments (.png) are saved in `./checkpoints` directory per 10000 steps by default.
@@ -290,9 +291,9 @@ From my experience, it can get reasonable speech quality very quickly rather tha
290291
There are two important options used above:
291292

292293
- `--restore-parts=<N>`: It specifies where to load model parameters. The differences from the option `--checkpoint=<N>` are 1) `--restore-parts=<N>` ignores all invalid parameters, while `--checkpoint=<N>` doesn't. 2) `--restore-parts=<N>` tell trainer to start from 0-step, while `--checkpoint=<N>` tell trainer to continue from last step. `--checkpoint=<N>` should be ok if you are using exactly same model and continue to train, but it would be useful if you want to customize your model architecture and take advantages of pre-trained model.
293-
- `--speaker-id=<N>`: It specifies what speaker of data is used for training. This should only be specified if you are using multi-speaker dataset. As for VCTK, speaker id is automatically assigned incrementally (0, 1, ..., 107) according to the `speaker_info.txt` in the dataset.
294+
- `--speaker-id=<N>`: It specifies what speaker of data is used for training. This should only be specified if you are using multi-speaker dataset. As for VCTK, speaker id is automatically assigned incrementally (0, 1, ..., 107) according to the `speaker_info.txt` in the dataset.
294295

295-
If you are training multi-speaker model, speaker adaptation will only work **when `n_speakers` is identical**.
296+
If you are training multi-speaker model, speaker adaptation will only work **when `n_speakers` is identical**.
296297

297298
## Acknowledgements
298299

0 commit comments

Comments
 (0)