[Feature request] Auralis optimisations of XTTS

**🚀 Feature Description**

Hi @eginhard. Hope you are keeping well! 

Its erew123 from [AllTalk ](https://github.com/erew123/alltalk_tts/tree/alltalkbeta?tab=readme-ov-file#alltalk-tts-v2)

Someone has pointed this out to me: https://www.astramind.ai/post/auralis and I think this is the GitHub https://github.com/astramind-ai/Auralis

Its a little beyond my pay grade, but maybe its of interest to the Coqui scripts. I dont know if you have seen this, or the author is posting on here with you, but I thought you may like to see it:

Firing their document into AI for a quick "Here is what they claim"

---

The author claims to have optimized **XTTS-v2**, a text-to-speech model, making it **faster, more resource-efficient, asynchronous, and safer for production environments**. Here are the key points and the performance gains:

### **What They Did**
1. **Understanding the Code and Challenges**:
   - Overcame a lack of prior experience in audio tech.
   - Debugged and worked around outdated dependencies and repos.
   
2. **Tokenizer Optimization**:
   - Replaced a custom tokenizer with a Hugging Face-compatible `FastPreTrainedTokenizer`.
   - Improved token splitting logic to maintain audio quality while handling memory-efficient truncation.

3. **Model Reorganization**:
   - Refactored the original architecture, which used GPT-2-like models and a HiFi-GAN vocoder, to eliminate unnecessary computations during inference.
   - Optimized the HiFi-GAN component to use in-place operations, drastically reducing memory usage.

4. **Integration of vLLM for GPT-2**:
   - Overcame challenges in adapting vLLM for multimodal GPT-2, including token cache management and continuous batching.
   - Addressed vLLM's limitations on repetition penalties and hidden state collection, customizing its behavior for audio-specific tasks.

5. **Asynchronous Execution**:
   - Made components non-blocking using `asyncio`.

6. **Optimized Workflow**:
   - Avoided redundant token and embedding calculations during iterative decoding.
   - Adapted position ID tracking to align with unique conditioning inputs for multimodal tasks.

### **Performance Gains**
1. **Speed**:
   - Leveraging vLLM and deduplicating computations significantly reduced inference time.
   
2. **Resource Efficiency**:
   - Memory consumption was slashed by optimizing HiFi-GAN for inference.
   - Reduced overhead by restructuring the GPT-2 and conditioning modules.

3. **Production Suitability**:
   - Ensured asynchronous, non-blocking execution for smoother integration into UI frameworks like Pulsar.
   - Increased safety by moving from `.pth` to safer formats and handling positional encoding appropriately.

4. **Accessibility**:
   - Made the enhancements available to the open-source community for broader adoption.

The overall result is a production-ready, optimized XTTS-v2 that is significantly faster and more memory-efficient, with asynchronous capabilities enabling smoother integration into applications.

---

Thanks Erew123

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature request] Auralis optimisations of XTTS #181

What They Did

Performance Gains

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature request] Auralis optimisations of XTTS #181

Description

What They Did

Performance Gains

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions