How GPT is being used? #264

hippotabek · 2023-01-23T09:57:33Z

hippotabek
Jan 23, 2023

Thank you for the work you have done and sharing it with everyone! I have read through you paper draft and it was mentioned that you have used GPT and it was the biggest contribution. And you mentioned that you would elaborate on it in appendix III, but there was no such appendix. Could you please briefly tell how GPT was used? Did you use GPT embeddings as your text encoder outputs? Which GPT model did you use exactly?

neonbjb · 2023-01-23T16:27:41Z

neonbjb
Jan 23, 2023
Maintainer

Hey thanks for the comment. At some point I restructured the document and folded what used to be appendix III into appendix II without updating the text. I've fixed that now.

I think the statement you are referring to is how I used the AR activations to improve the performance of the diffusion model. This is described in more detail here.

Note that in this document, I use "GPT" to refer to the model architecture, not the model developed by OpenAI. No pre-trained text encoders were used with Tortoise. I experimented with this briefly but found it to be detrimental. This makes sense because TTS systems care more about the phonetic interpretation of each text character than the deeper meaning behind the words in a sentence.

It would probably be beneficial to condition a model like Tortoise on both the character level embeddings and the output of, for example, T5. There's been some success with using this idea in research for txt2im models lately that shows it allows these types of models to spell better.

2 replies

hippotabek Jan 26, 2023
Author

Thank you very much for your answer. Also, I found your blogpost describing the architecture and it made things clearer for me.

hippotabek Jan 26, 2023
Author

Also, I am currently trying to use some parts of your architecture within VITS model. I got the inputs that are passed to your diffusion model and now I am trying to pass it to VITS model and see whether I can get better TTS results. My task is to get the best possible voice adaptation of TTS given ~20 minutes of some random speaker's. I tried tortoise-tts but zero shot approach doesn't always give the best voice matching results. Maybe you can share some insights if you have regarding the task I am doing, it would be greatly appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How GPT is being used? #264

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How GPT is being used? #264

Uh oh!

hippotabek Jan 23, 2023

Replies: 1 comment · 2 replies

Uh oh!

neonbjb Jan 23, 2023 Maintainer

Uh oh!

hippotabek Jan 26, 2023 Author

Uh oh!

hippotabek Jan 26, 2023 Author

hippotabek
Jan 23, 2023

Replies: 1 comment 2 replies

neonbjb
Jan 23, 2023
Maintainer

hippotabek Jan 26, 2023
Author

hippotabek Jan 26, 2023
Author