Merge pull request #82 from r9y9/pytorch0.4

r9y9 · web-flow · commit e637ea2f97ce · 2018-05-04T22:23:26.000+09:00
Support PyTorch &gt;= v0.4
diff --git a/README.md b/README.md
@@ -24,7 +24,7 @@ A notebook supposed to be executed on https://colab.research.google.com is avail
 - Multi-speaker and single speaker versions of DeepVoice3
 - Audio samples and pre-trained models
 - Preprocessor for [LJSpeech (en)](https://keithito.com/LJ-Speech-Dataset/), [JSUT (jp)](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and [VCTK](http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html) datasets, as well as [carpedm20/multi-speaker-tacotron-tensorflow](https://github.com/carpedm20/multi-Speaker-tacotron-tensorflow) compatible custom dataset (in JSON format)
-- Language-dependent frontend text processor for English and Japanese 
+- Language-dependent frontend text processor for English and Japanese
 
 ### Samples
 
@@ -61,6 +61,7 @@ See "Synthesize from a checkpoint" section in the README for how to generate spe
 
 - Python 3
 - CUDA >= 8.0
+- PyTorch >= v0.4.0
 - TensorFlow >= v1.3
 - [nnmnkwii](https://github.com/r9y9/nnmnkwii) >= v0.0.11
 - [MeCab](http://taku910.github.io/mecab/) (Japanese only)
@@ -104,7 +105,7 @@ python train.py --preset=presets/deepvoice3_ljspeech.json --data-root=./data/ljs
 - LJSpeech (en): https://keithito.com/LJ-Speech-Dataset/
 - VCTK (en): http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html
 - JSUT (jp): https://sites.google.com/site/shinnosuketakamichi/publication/jsut
-- NIKL (ko) (**Need korean cellphone number to access it**): http://www.korean.go.kr/front/board/boardStandardView.do?board_id=4&mn_id=17&b_seq=464 
+- NIKL (ko) (**Need korean cellphone number to access it**): http://www.korean.go.kr/front/board/boardStandardView.do?board_id=4&mn_id=17&b_seq=464
 
 ### 1. Preprocessing
 
@@ -147,14 +148,14 @@ python preprocess.py json_meta "./datasets/datasetA/alignment.json,./datasets/da
 
 #### 1-2. Preprocessing custom english datasets with long silence. (Based on [vctk_preprocess](vctk_preprocess/))
 
-Some dataset, especially automatically generated dataset may include long silence and undesirable leading/trailing noises, undermining the char-level seq2seq model. 
+Some dataset, especially automatically generated dataset may include long silence and undesirable leading/trailing noises, undermining the char-level seq2seq model.
 (e.g. VCTK, although this is covered in vctk_preprocess)
 
 To deal with the problem, `gentle_web_align.py` will
-- **Prepare phoneme alignments for all utterances** 
-- Cut silences during preprocessing 
+- **Prepare phoneme alignments for all utterances**
+- Cut silences during preprocessing
 
-`gentle_web_align.py` uses [Gentle](https://github.com/lowerquality/gentle), a kaldi based speech-text alignment tool. This accesses web-served Gentle application, aligns given sound segments with transcripts and converts the result to HTK-style label files, to be processed in `preprocess.py`. Gentle can be run in Linux/Mac/Windows(via Docker). 
+`gentle_web_align.py` uses [Gentle](https://github.com/lowerquality/gentle), a kaldi based speech-text alignment tool. This accesses web-served Gentle application, aligns given sound segments with transcripts and converts the result to HTK-style label files, to be processed in `preprocess.py`. Gentle can be run in Linux/Mac/Windows(via Docker).
 
 Preliminary results show that while HTK/festival/merlin-based method in `vctk_preprocess/prepare_vctk_labels.py` works better on VCTK, Gentle is more stable with audio clips with ambient noise. (e.g. movie excerpts)
 
@@ -182,7 +183,7 @@ python train.py --data-root=${data-root} --preset=<json> --hparams="parameters y
 Suppose you build a DeepVoice3-style model using LJSpeech dataset, then you can train your model by:
 
 ```
-python train.py --preset=presets/deepvoice3_ljspeech.json --data-root=./data/ljspeech/ 
+python train.py --preset=presets/deepvoice3_ljspeech.json --data-root=./data/ljspeech/
 ```
 
 Model checkpoints (.pth) and alignments (.png) are saved in `./checkpoints` directory per 10000 steps by default.
@@ -290,9 +291,9 @@ From my experience, it can get reasonable speech quality very quickly rather tha
 There are two important options used above:
 
 - `--restore-parts=<N>`: It specifies where to load model parameters. The differences from the option `--checkpoint=<N>` are 1) `--restore-parts=<N>` ignores all invalid parameters, while `--checkpoint=<N>` doesn't. 2) `--restore-parts=<N>` tell trainer to start from 0-step, while `--checkpoint=<N>` tell trainer to continue from last step. `--checkpoint=<N>` should be ok if you are using exactly same model and continue to train, but it would be useful if you want to customize your model architecture and take advantages of pre-trained model.
-- `--speaker-id=<N>`: It specifies what speaker of data is used for training. This should only be specified if you are using multi-speaker dataset. As for VCTK, speaker id is automatically assigned incrementally (0, 1, ..., 107) according to the `speaker_info.txt` in the dataset. 
+- `--speaker-id=<N>`: It specifies what speaker of data is used for training. This should only be specified if you are using multi-speaker dataset. As for VCTK, speaker id is automatically assigned incrementally (0, 1, ..., 107) according to the `speaker_info.txt` in the dataset.
 
-If you are training multi-speaker model, speaker adaptation will only work **when `n_speakers` is identical**. 
+If you are training multi-speaker model, speaker adaptation will only work **when `n_speakers` is identical**.
 
 ## Acknowledgements
 
diff --git a/audio.py b/audio.py
@@ -14,7 +14,7 @@ def load_wav(path):
 
 
 def save_wav(wav, path):
-    wav *= 32767 / max(0.01, np.max(np.abs(wav)))
+    wav = wav * 32767 / max(0.01, np.max(np.abs(wav)))
     wavfile.write(path, hparams.sample_rate, wav.astype(np.int16))
 
 
diff --git a/deepvoice3_pytorch/conv.py b/deepvoice3_pytorch/conv.py
@@ -40,7 +40,7 @@ def incremental_forward(self, input):
                 self.input_buffer[:, :-1, :] = self.input_buffer[:, 1:, :].clone()
             # append next input
             self.input_buffer[:, -1, :] = input[:, -1, :]
-            input = self.input_buffer.clone()
+            input = self.input_buffer
             if dilation > 1:
                 input = input[:, 0::dilation, :].contiguous()
         output = F.linear(input.view(bsz, -1), weight, self.bias)
diff --git a/deepvoice3_pytorch/deepvoice3.py b/deepvoice3_pytorch/deepvoice3.py
@@ -3,7 +3,6 @@
 import torch
 from torch import nn
 from torch.nn import functional as F
-from torch.autograd import Variable
 import math
 import numpy as np
 
@@ -207,9 +206,9 @@ def __init__(self, embed_dim, n_speakers, speaker_embed_dim,
 
         # Position encodings for query (decoder states) and keys (encoder states)
         self.embed_query_positions = SinusoidalEncoding(
-            max_positions, convolutions[0][0], padding_idx)
+            max_positions, convolutions[0][0])
         self.embed_keys_positions = SinusoidalEncoding(
-            max_positions, embed_dim, padding_idx)
+            max_positions, embed_dim)
         # Used for compute multiplier for positional encodings
         if n_speakers > 1:
             self.speaker_proj1 = Linear(speaker_embed_dim, 1, dropout=dropout)
@@ -393,12 +392,11 @@ def incremental_forward(self, encoder_out, text_positions, speaker_embed=None,
         num_attention_layers = sum([layer is not None for layer in self.attention])
         t = 0
         if initial_input is None:
-            initial_input = Variable(
-                keys.data.new(B, 1, self.in_dim * self.r).zero_())
+            initial_input = keys.data.new(B, 1, self.in_dim * self.r).zero_()
         current_input = initial_input
         while True:
             # frame pos start with 1.
-            frame_pos = Variable(keys.data.new(B, 1).fill_(t + 1)).long()
+            frame_pos = keys.data.new(B, 1).fill_(t + 1).long()
             w = self.query_position_rate
             if self.speaker_proj2 is not None:
                 w = w * F.sigmoid(self.speaker_proj2(speaker_embed)).view(-1)
diff --git a/deepvoice3_pytorch/modules.py b/deepvoice3_pytorch/modules.py
@@ -32,24 +32,24 @@ def sinusoidal_encode(x, w):
 
 
 class SinusoidalEncoding(nn.Embedding):
-    def __init__(self, num_embeddings, embedding_dim, padding_idx=0,
+
+    def __init__(self, num_embeddings, embedding_dim,
                  *args, **kwargs):
         super(SinusoidalEncoding, self).__init__(num_embeddings, embedding_dim,
-                                                 padding_idx, *args, **kwargs)
+                                                 padding_idx=0,
+                                                 *args, **kwargs)
         self.weight.data = position_encoding_init(num_embeddings, embedding_dim,
                                                   position_rate=1.0,
                                                   sinusoidal=False)
 
     def forward(self, x, w=1.0):
         isscaler = np.isscalar(w)
-        padding_idx = self.padding_idx
-        if padding_idx is None:
-            padding_idx = -1
+        assert self.padding_idx is not None
 
         if isscaler or w.size(0) == 1:
             weight = sinusoidal_encode(self.weight, w)
             return F.embedding(
-                x, weight, padding_idx, self.max_norm,
+                x, weight, self.padding_idx, self.max_norm,
                 self.norm_type, self.scale_grad_by_freq, self.sparse)
         else:
             # TODO: cannot simply apply for batch
@@ -58,7 +58,7 @@ def forward(self, x, w=1.0):
             for batch_idx, we in enumerate(w):
                 weight = sinusoidal_encode(self.weight, we)
                 pe.append(F.embedding(
-                    x[batch_idx], weight, padding_idx, self.max_norm,
+                    x[batch_idx], weight, self.padding_idx, self.max_norm,
                     self.norm_type, self.scale_grad_by_freq, self.sparse))
             pe = torch.stack(pe)
             return pe
diff --git a/deepvoice3_pytorch/nyanko.py b/deepvoice3_pytorch/nyanko.py
@@ -3,7 +3,6 @@
 import torch
 from torch import nn
 from torch.nn import functional as F
-from torch.autograd import Variable
 import math
 import numpy as np
 
@@ -270,12 +269,11 @@ def incremental_forward(self, encoder_out, text_positions,
 
         t = 0
         if initial_input is None:
-            initial_input = Variable(
-                keys.data.new(B, 1, self.in_dim * self.r).zero_())
+            initial_input = keys.data.new(B, 1, self.in_dim * self.r).zero_()
         current_input = initial_input
         while True:
             # frame pos start with 1.
-            frame_pos = Variable(keys.data.new(B, 1).fill_(t + 1)).long()
+            frame_pos = keys.data.new(B, 1).fill_(t + 1).long()
             frame_pos_embed = self.embed_query_positions(frame_pos)
 
             if test_inputs is not None:
diff --git a/hparams.py b/hparams.py
@@ -99,6 +99,7 @@
     adam_beta1=0.5,
     adam_beta2=0.9,
     adam_eps=1e-6,
+    amsgrad=False,
     initial_learning_rate=5e-4,  # 0.001,
     lr_schedule="noam_learning_rate_decay",
     lr_schedule_kwargs={},
@@ -125,14 +126,16 @@
     # Forced garbage collection probability
     # Use only when MemoryError continues in Windows (Disabled by default)
     #gc_probability = 0.001,
-    
+
     # json_meta mode only
     # 0: "use all",
     # 1: "ignore only unmatched_alignment",
     # 2: "fully ignore recognition",
-    ignore_recognition_level = 2,
-    min_text=20, # when dealing with non-dedicated speech dataset(e.g. movie excerpts), setting min_text above 15 is desirable. Can be adjusted by dataset.
-    process_only_htk_aligned = False, # if true, data without phoneme alignment file(.lab) will be ignored
+    ignore_recognition_level=2,
+    # when dealing with non-dedicated speech dataset(e.g. movie excerpts), setting min_text above 15 is desirable. Can be adjusted by dataset.
+    min_text=20,
+    # if true, data without phoneme alignment file(.lab) will be ignored
+    process_only_htk_aligned=False,
 )
 
 
diff --git a/setup.py b/setup.py
@@ -79,7 +79,7 @@ def create_readme_rst():
       install_requires=[
           "numpy",
           "scipy",
-          "torch >= 0.3.0",
+          "torch >= 0.4.0",
           "unidecode",
           "inflect",
           "librosa",
diff --git a/synthesis.py b/synthesis.py
@@ -25,7 +25,6 @@
 import audio
 
 import torch
-from torch.autograd import Variable
 import numpy as np
 import nltk
 
@@ -36,6 +35,7 @@
 from tqdm import tqdm
 
 use_cuda = torch.cuda.is_available()
+device = torch.device("cuda" if use_cuda else "cpu")
 _frontend = None  # to be set later
 
 
@@ -46,25 +46,20 @@ def tts(model, text, p=0, speaker_id=None, fast=False):
         text (str) : Input text to be synthesized
         p (float) : Replace word to pronounciation if p > 0. Default is 0.
     """
-    if use_cuda:
-        model = model.cuda()
+    model = model.to(device)
     model.eval()
     if fast:
         model.make_generation_fast_()
 
     sequence = np.array(_frontend.text_to_sequence(text, p=p))
-    sequence = Variable(torch.from_numpy(sequence)).unsqueeze(0).long()
-    text_positions = torch.arange(1, sequence.size(-1) + 1).unsqueeze(0).long()
-    text_positions = Variable(text_positions)
-    speaker_ids = None if speaker_id is None else Variable(torch.LongTensor([speaker_id]))
-    if use_cuda:
-        sequence = sequence.cuda()
-        text_positions = text_positions.cuda()
-        speaker_ids = None if speaker_ids is None else speaker_ids.cuda()
+    sequence = torch.from_numpy(sequence).unsqueeze(0).long().to(device)
+    text_positions = torch.arange(1, sequence.size(-1) + 1).unsqueeze(0).long().to(device)
+    speaker_ids = None if speaker_id is None else torch.LongTensor([speaker_id]).to(device)
 
     # Greedy decoding
-    mel_outputs, linear_outputs, alignments, done = model(
-        sequence, text_positions=text_positions, speaker_ids=speaker_ids)
+    with torch.no_grad():
+        mel_outputs, linear_outputs, alignments, done = model(
+            sequence, text_positions=text_positions, speaker_ids=speaker_ids)
 
     linear_output = linear_outputs[0].cpu().data.numpy()
     spectrogram = audio._denormalize(linear_output)
diff --git a/tests/test_conv.py b/tests/test_conv.py
@@ -3,7 +3,6 @@
 
 import torch
 from torch import nn
-from torch.autograd import Variable
 from torch.nn import functional as F
 from deepvoice3_pytorch.conv import Conv1d
 
@@ -36,7 +35,7 @@ def __test(kernel_size, dilation, T, B, C, causual=True):
         conv_online.bias.data.zero_()
 
         # (B, C, T)
-        bct = Variable(torch.zeros(B, C, T) + torch.arange(0, T))
+        bct = torch.zeros(B, C, T) + torch.arange(0, T).float()
         output_conv = conv(bct)
 
         # Remove future time stamps
diff --git a/tests/test_deepvoice3.py b/tests/test_deepvoice3.py
@@ -7,7 +7,6 @@
 from deepvoice3_pytorch.frontend.en import text_to_sequence, n_vocab
 
 import torch
-from torch.autograd import Variable
 from torch import nn
 import numpy as np
 
@@ -60,8 +59,8 @@ def _test_data():
     seqs = np.array([_pad(s, max_len) for s in seqs])
 
     # Test encoder
-    x = Variable(torch.LongTensor(seqs))
-    y = Variable(torch.rand(x.size(0), 12, 80))
+    x = torch.LongTensor(seqs)
+    y = torch.rand(x.size(0), 12, 80)
 
     return x, y
 
@@ -132,10 +131,10 @@ def test_multi_speaker_deepvoice3():
     seqs = np.array([_pad(s, max_len) for s in seqs])
 
     # Test encoder
-    x = Variable(torch.LongTensor(seqs))
-    y = Variable(torch.rand(x.size(0), 4 * 33, 80))
+    x = torch.LongTensor(seqs)
+    y = torch.rand(x.size(0), 4 * 33, 80)
     model = _get_model(n_speakers=32, speaker_embed_dim=16)
-    speaker_ids = Variable(torch.LongTensor([1, 2, 3]))
+    speaker_ids = torch.LongTensor([1, 2, 3])
 
     mel_outputs, linear_outputs, alignments, done = model(x, y, speaker_ids=speaker_ids)
     print("Input text:", x.size())
@@ -154,12 +153,12 @@ def test_incremental_path_multiple_times():
 
     r = 4
     mel_dim = 80
-    sequence = Variable(torch.LongTensor(seqs))
-    text_positions = Variable(torch.LongTensor(text_positions))
+    sequence = torch.LongTensor(seqs)
+    text_positions = torch.LongTensor(text_positions)
 
     for model, speaker_ids in [
             (_get_model(force_monotonic_attention=False), None),
-            (_get_model(force_monotonic_attention=False, n_speakers=32, speaker_embed_dim=16), Variable(torch.LongTensor([1])))]:
+            (_get_model(force_monotonic_attention=False, n_speakers=32, speaker_embed_dim=16), torch.LongTensor([1]))]:
         model.eval()
 
         # first call
@@ -192,17 +191,17 @@ def test_incremental_correctness():
         max_target_len += r - max_target_len % r
         assert max_target_len % r == 0
     mel = _pad_2d(mel, max_target_len)
-    mel = Variable(torch.from_numpy(mel))
+    mel = torch.from_numpy(mel)
     mel_reshaped = mel.view(1, -1, mel_dim * r)
     frame_positions = np.arange(1, mel_reshaped.size(1) + 1).reshape(1, mel_reshaped.size(1))
 
-    x = Variable(torch.LongTensor(seqs))
-    text_positions = Variable(torch.LongTensor(text_positions))
-    frame_positions = Variable(torch.LongTensor(frame_positions))
+    x = torch.LongTensor(seqs)
+    text_positions = torch.LongTensor(text_positions)
+    frame_positions = torch.LongTensor(frame_positions)
 
     for model, speaker_ids in [
             (_get_model(force_monotonic_attention=False), None),
-            (_get_model(force_monotonic_attention=False, n_speakers=32, speaker_embed_dim=16), Variable(torch.LongTensor([1])))]:
+            (_get_model(force_monotonic_attention=False, n_speakers=32, speaker_embed_dim=16), torch.LongTensor([1]))]:
         model.eval()
 
         if speaker_ids is not None:
@@ -269,14 +268,14 @@ def test_incremental_forward():
         max_target_len += r - max_target_len % r
         assert max_target_len % r == 0
     mel = _pad_2d(mel, max_target_len)
-    mel = Variable(torch.from_numpy(mel))
+    mel = torch.from_numpy(mel)
     mel_reshaped = mel.view(1, -1, mel_dim * r)
 
     frame_positions = np.arange(1, mel_reshaped.size(1) + 1).reshape(1, mel_reshaped.size(1))
 
-    x = Variable(torch.LongTensor(seqs))
-    text_positions = Variable(torch.LongTensor(text_positions))
-    frame_positions = Variable(torch.LongTensor(frame_positions))
+    x = torch.LongTensor(seqs)
+    text_positions = torch.LongTensor(text_positions)
+    frame_positions = torch.LongTensor(frame_positions)
 
     if use_cuda:
         x = x.cuda()
@@ -290,7 +289,8 @@ def _plot(mel, mel_predicted, alignments):
         from matplotlib import pylab as plt
         plt.figure(figsize=(16, 10))
         plt.subplot(3, 1, 1)
-        plt.imshow(mel.data.cpu().numpy().T, origin="lower bottom", aspect="auto", cmap="magma")
+        plt.imshow(mel.data.cpu().numpy().T, origin="lower bottom",
+                   aspect="auto", cmap="magma")
         plt.colorbar()
 
         plt.subplot(3, 1, 2)
diff --git a/tests/test_embedding.py b/tests/test_embedding.py
diff --git a/tests/test_nyanko.py b/tests/test_nyanko.py
diff --git a/train.py b/train.py