A Proposition for Voice Generation and Conversion in ComfyUI #2737

hydrusbeta · 2024-02-07T03:27:47Z

hydrusbeta
Feb 7, 2024

Hello,

I am the creator and maintainer of Hay Say, an interface for AI-powered Text-To-Speech and Voice Conversion:
Git: https://github.com/hydrusbeta/hay_say_ui
Live Server: https://haysay.ai/

A couple months ago, I began working on a complete rewrite of Hay Say from scratch, with the goal of creating a REST API and a node editor interface so that users can create their own pipelines and mix and match components. However, a few days ago, I discovered ComfyUI and I was taken aback by the similarities between it and my vision for Hay Say 2.0. This has got me wondering whether I should just attempt to extend ComfyUI to work with voice AI models. I have a few questions/ points of discussion:

Has adding voice capabilities for ComfyUI ever been discussed? I didn't spot anything in the discussions or issues tabs on this GitHub repo.
I've only started looking over the code for ComfyUI so I have limited knowledge of its inner workings so far. If anyone more knowledgeable has input as to the feasability of such a project, I'm all ears. In the meantime, I'll keep familiarizing myself with the codebase.
Lastly, if any maintainers of ComfyUI see this discussion, would you be receptive to me adding voice AI to ComfyUI itself, or would you prefer to keep it a separate project?

I haven't committed 100% to using ComfyUI yet. Any responses to this discussion topic (especially topic # 3) will likely influence my decision as to whether I embrace ComfyUI or continue doing my own thing.

lengmiao · 2024-02-07T18:09:22Z

lengmiao
Feb 7, 2024

Good Idea. Simpler and more involved, perhaps with a simple audio type support

0 replies

Humanoidme · 2024-02-07T18:17:09Z

Humanoidme
Feb 7, 2024

That would be interesting indeed. Does Hey Say allow the use of 3rd party TTS and VC services like 11 labs ?

…

On Wed, Feb 7, 2024 at 10:09 AM 佩奇 ***@***.***> wrote: Good Idea. Simpler and more involved, perhaps with a simple audio type support — Reply to this email directly, view it on GitHub <#2737 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACDISOHD6XEAGW4WDWN2CBLYSO7N5AVCNFSM6AAAAABC5CY2GCVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DGOJYGM3TO> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.*** com>

-- Eduardo Yeh Co-Founder, CEO Selvz LLC <http://selvz.com/> NOTICE: This transmission may contain privileged and confidential information. It is intended only for the use and view of the intended recipients SPECIFICALLY LISTED as addressees above. If you are not the intended recipient, YOU are hereby notified that any review, dissemination, distribution or duplication of this communication is strictly prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. Also, due to the susceptibility of electronic communication to corruption, the sender warrants neither the accuracy nor the completeness of this communication

3 replies

hydrusbeta Feb 8, 2024
Author

Hay Say currently leverages open source TTS and VC solutions that are possible to install and run locally. I thought about adding an interface for one particular 3rd party service (15.ai) and accessing it through web service calls, but that particular service has been down for months so there's been no opportunity to explore that possibility. I haven't bothered to reach out to the developer of 15.ai yet anyways to see whether they'd be OK with it.

Technologically, it is certainly possible to integrate with 3rd party services over the web. It does introduce a couple of security considerations, especially in a public, non-local environment like https://haysay.ai. User credentials (like Elevenlab's xi-api-key) must never be logged, and you'd need to worry about people making too many webservice calls (to either DOS the 3rd party service using the Hay Say server as a proxy, or to get Hay Say blacklisted by the 3rd party service as a sort of DOS on Hay Say itself). Storing User credentials for 3rd party websites is too hazardous, because it would make haysay.ai a target for hackers, so I would probably not store them at all and require the user to enter credentials every time they open the site, which is a bit inconvenient. Storing credentials is safer on a local installation, but then you have to be absolutely certain not to mix up local and server configurations.

hydrusbeta Feb 8, 2024
Author

Hmmm. This gets me thinking. Is it possible to set up ComfyUI as a public server? If so, it would face the same sorts of security considerations.

comfyanonymous Feb 8, 2024
Maintainer

There should be no way to remote exploit comfyui itself but I don't recommend running it as a public server because anyone who has access can do things like queue very large workflows or cancel currently running ones.

comfyanonymous · 2024-02-08T06:16:49Z

comfyanonymous
Feb 8, 2024
Maintainer

I do want to add auto capabilities to ComfyUI eventually. If it should be added to the base or not depends on how much code can be shared between the audio and image models.

2 replies

hydrusbeta Feb 10, 2024
Author

Makes sense to me. I've decided to take a deep dive into the ComfyUI codebase to understand how it all works and I'll see if I can come up with something elegant that leverages the existing code well. I'll report my findings here and hopefully get back to you with a more detailed proposal if it looks promising!

tavyscrolls May 31, 2024

I've been working on a custom node and have ran into the problem that there are just too many packages that represent audio data and almost none of the audio nodes tend to work together. My current approach is to have a utility node that manipulates data types/structures to pass to an any type node into whatever custom node (along with feature extraction for metadata/required inputs and spectrogram option), so much could be simplified by setting a standard for audio data to be passed within comfy. Otherwise, you're writing and loading files to disk before and after every extension change. My end goal is to be able to take an audio file, isolate speakers, extract text and voice, voice conversion and blend with ambient and music tracks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A Proposition for Voice Generation and Conversion in ComfyUI #2737

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

A Proposition for Voice Generation and Conversion in ComfyUI #2737

Uh oh!

hydrusbeta Feb 7, 2024

Replies: 3 comments · 5 replies

Uh oh!

lengmiao Feb 7, 2024

Uh oh!

Humanoidme Feb 7, 2024

Uh oh!

hydrusbeta Feb 8, 2024 Author

Uh oh!

hydrusbeta Feb 8, 2024 Author

Uh oh!

comfyanonymous Feb 8, 2024 Maintainer

Uh oh!

comfyanonymous Feb 8, 2024 Maintainer

Uh oh!

hydrusbeta Feb 10, 2024 Author

Uh oh!

tavyscrolls May 31, 2024

hydrusbeta
Feb 7, 2024

Replies: 3 comments 5 replies

lengmiao
Feb 7, 2024

Humanoidme
Feb 7, 2024

hydrusbeta Feb 8, 2024
Author

hydrusbeta Feb 8, 2024
Author

comfyanonymous Feb 8, 2024
Maintainer

comfyanonymous
Feb 8, 2024
Maintainer

hydrusbeta Feb 10, 2024
Author