-
-
Notifications
You must be signed in to change notification settings - Fork 16
Feature Request: Incremental TTS Streaming with Text Chunking #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Incremental TTS Streaming with Text Chunking #26
Conversation
Primary improvement: - Implement proper text accumulation and segmentation across chunks - Ensure complete sentences are detected and synthesized immediately - Maintain sentence boundary detection state during streaming synthesis Additional refinements: - Fix duplicate AudioStop/SynthesizeStopped events by adding early return - Add segmenter caching per language to improve performance - Enhance error handling documentation and logging - Add _truncate_for_log helper for consistent log formatting Related to #22 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
…dary chunking - Add mention of TTS streaming compatibility in overview section - Add new objective about streaming compatibility bridging Wyoming and OpenAI protocols - Update sequence diagram to show incremental synthesis with pysbd sentence detection - Emphasize responsive audio delivery through intelligent text chunking 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
…cremental-tts-streaming-with-text-chunking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request introduces incremental TTS streaming support to the Wyoming-OpenAI proxy by implementing intelligent text chunking at sentence boundaries. This enhancement enables responsive audio delivery for Wyoming clients even when the underlying OpenAI-compatible API doesn't support streaming text input.
Key Changes:
- Added incremental TTS streaming support using pySBD for sentence boundary detection to enable real-time audio delivery
- Introduced new configuration options for streaming models and chunking parameters
- Updated the Wyoming/OpenAI compatibility layer to support streaming and non-streaming TTS models separately
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| src/wyoming_openai/handler.py | Implemented core streaming TTS logic with sentence detection and parallel audio synthesis |
| src/wyoming_openai/compatibility.py | Updated TTS voice creation to separate streaming/non-streaming models and support fallback behavior |
| src/wyoming_openai/main.py | Added command-line arguments for streaming configuration and enhanced logging output |
| pyproject.toml | Added pysbd dependency for sentence boundary detection |
| README.md | Updated documentation to describe streaming capabilities and new configuration options |
| tests/ | Updated test files to match new function signatures with streaming model parameters |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
will this be pushed to the master or will be a seperate variant? :) |
There is ghcr.io/roryeckel/wyoming_openai:pr-26 will be available to test and use as a preview until I am ready to send it to main... I am hesitant because I only tested it a little bit, and there's been some progress regarding the official realtime API at OpenAI. I've also got some other projects taking up my time at the moment but don't worry I will get back to it in the near future. |
|
My main concern is the collision of "streaming" versus "realtime". I bet we can just do TTS_REALTIME_MODELS or something once realtime is supported, so maybe my concern doesn't make sense. |
|
Just tested TTS streaming with this PR in combination with Chatterbox and the "Chatterbox-TTS-Server" (see link below). Its working and responses are streamed to Home Assistant. Finally I can use nice voices with relative snappy speech replies. https://github.com/devnen/Chatterbox-TTS-Server PS: regarding the realtime, I don't know honestly. All I think is, that this kind of realtime first need some kind of support by Home Assistant at first. Right now streaming is the current solution which is officially supported. Don't know what will haben in the future. For sure, having some kind of realtime conversation with Home Assistant would be awesome :D |
|
Thanks for your testing I'm glad you enjoyed. Chatterbox TTS server looks like a cool project I could provide an example compose for in the future as well. |
|
@roryeckel I really like this work. I was wonder as a future improvement if we could somehow be able to configure settings per model instead of just globally. |
a hacky workarround right know would be setup multiple instances which have different settings. Not elegant but should work. |
The streaming doesn't happen by default, it has to be configured via TTS_STREAMING_MODELS similar to as you have said. The readme on this code branch indicates these new configuration options |
@roryeckel Yes, agree. What I meant is that unless you do like @maglat mentioned (spin up more instances of wyoming_openapi), you cannot have more granular settings. Say I have one model for a given set of languages and another for another set of languages. It would be nice to configure the streaming settings per model. One could even go as far and say per language, which is actually what makes more sense (more than model I suppose). Like, maybe for many European languages, it's probably the same, but the thresholds may vary if we have say Asian languages. Does that make sense? |
The TTS_STREAMING_MODELS expects a list of model names, you would simply exclude the models you don't want streaming and leave them in TTS_MODELS |
Yes, I understand. I am not talking about streaming vs non-streaming. I am talking about streaming specifically.
Let me call the first parameter X and the second Y. I am saying for some streaming model 1 being able to define X and Y different from stream model 2's X and Y. Makes sense? |
|
But yes for things beyond streaming, multiple instances is kind of the expectation at the moment |
This pull request introduces incremental TTS (text-to-speech) streaming support by intelligently chunking text at sentence boundaries, improving responsiveness for Wyoming clients even when the underlying OpenAI-compatible API does not support streaming text input. The implementation adds configuration options for streaming models and chunking parameters, updates the Wyoming/OpenAI compatibility layer, and enhances documentation and tests to reflect these new capabilities.
Incremental TTS Streaming Support
README.mdand the system sequence diagram. [1] [2] [3]--tts-streaming-models,--tts-streaming-min-words, and--tts-streaming-max-charsfor configuring which models support streaming and how text is chunked. These are documented in theREADME.mdand implemented in the CLI parser. [1] [2] [3]Core Implementation and API Changes
compatibility.pyto distinguish between streaming and non-streaming models, and to create separate Wyoming TTS programs for each. [1] [2]Dependency and Testing Updates
pysbdlibrary for sentence boundary detection, required for intelligent text chunking.Resolves #22