-
Notifications
You must be signed in to change notification settings - Fork 54
Open
Description
@lucidrains
I've seen e2-tts-pytorch, it is NAR text to speech,And a duration predictor model is required. I want to use transfusion to implement tts, is it feasible?
My implementation method:
hidden_dim = 256
Transfusion(
encoder = vocos.feature_extractor
decoder = vocos.decode
pre_post_transformer_enc_dec = (nn.Conv1d(mel_channels, hidden_dim, 3, 2, 1),
nn.ConvTranspose1d(hidden_dim, mel_channels, 3, 2, 1, output_padding = 1))
transformer = dict(
dim = hidden_dim,
depth = 8,
dim_head = 64,
heads = 16,
)
)
The first step is to train speech generation using only speech data
The second step uses [text, speech] multimodal data
The current problem is that the loss in the first step did not go down,Is there something missing?
Metadata
Metadata
Assignees
Labels
No labels