-
Notifications
You must be signed in to change notification settings - Fork 211
Description
🚀 Feature Description
Hi @eginhard. Hope you are keeping well!
Its erew123 from AllTalk
Someone has pointed this out to me: https://www.astramind.ai/post/auralis and I think this is the GitHub https://github.com/astramind-ai/Auralis
Its a little beyond my pay grade, but maybe its of interest to the Coqui scripts. I dont know if you have seen this, or the author is posting on here with you, but I thought you may like to see it:
Firing their document into AI for a quick "Here is what they claim"
The author claims to have optimized XTTS-v2, a text-to-speech model, making it faster, more resource-efficient, asynchronous, and safer for production environments. Here are the key points and the performance gains:
What They Did
-
Understanding the Code and Challenges:
- Overcame a lack of prior experience in audio tech.
- Debugged and worked around outdated dependencies and repos.
-
Tokenizer Optimization:
- Replaced a custom tokenizer with a Hugging Face-compatible
FastPreTrainedTokenizer
. - Improved token splitting logic to maintain audio quality while handling memory-efficient truncation.
- Replaced a custom tokenizer with a Hugging Face-compatible
-
Model Reorganization:
- Refactored the original architecture, which used GPT-2-like models and a HiFi-GAN vocoder, to eliminate unnecessary computations during inference.
- Optimized the HiFi-GAN component to use in-place operations, drastically reducing memory usage.
-
Integration of vLLM for GPT-2:
- Overcame challenges in adapting vLLM for multimodal GPT-2, including token cache management and continuous batching.
- Addressed vLLM's limitations on repetition penalties and hidden state collection, customizing its behavior for audio-specific tasks.
-
Asynchronous Execution:
- Made components non-blocking using
asyncio
.
- Made components non-blocking using
-
Optimized Workflow:
- Avoided redundant token and embedding calculations during iterative decoding.
- Adapted position ID tracking to align with unique conditioning inputs for multimodal tasks.
Performance Gains
-
Speed:
- Leveraging vLLM and deduplicating computations significantly reduced inference time.
-
Resource Efficiency:
- Memory consumption was slashed by optimizing HiFi-GAN for inference.
- Reduced overhead by restructuring the GPT-2 and conditioning modules.
-
Production Suitability:
- Ensured asynchronous, non-blocking execution for smoother integration into UI frameworks like Pulsar.
- Increased safety by moving from
.pth
to safer formats and handling positional encoding appropriately.
-
Accessibility:
- Made the enhancements available to the open-source community for broader adoption.
The overall result is a production-ready, optimized XTTS-v2 that is significantly faster and more memory-efficient, with asynchronous capabilities enabling smoother integration into applications.
Thanks Erew123