Skip to content

Conversation

@roryeckel
Copy link
Owner

This pull request introduces incremental TTS (text-to-speech) streaming support by intelligently chunking text at sentence boundaries, improving responsiveness for Wyoming clients even when the underlying OpenAI-compatible API does not support streaming text input. The implementation adds configuration options for streaming models and chunking parameters, updates the Wyoming/OpenAI compatibility layer, and enhances documentation and tests to reflect these new capabilities.

Incremental TTS Streaming Support

  • Added incremental TTS streaming compatibility by chunking text at sentence boundaries, enabling responsive audio delivery for Wyoming clients. This is highlighted in the README.md and the system sequence diagram. [1] [2] [3]
  • Introduced new command-line arguments and environment variables: --tts-streaming-models, --tts-streaming-min-words, and --tts-streaming-max-chars for configuring which models support streaming and how text is chunked. These are documented in the README.md and implemented in the CLI parser. [1] [2] [3]

Core Implementation and API Changes

  • Updated the TTS program and voice creation logic in compatibility.py to distinguish between streaming and non-streaming models, and to create separate Wyoming TTS programs for each. [1] [2]
  • Modified the unified API and backend logic to support the new streaming model selection and fallback behavior, ensuring consistent ordering and deduplication of models. [1] [2]
  • Integrated the new streaming options into the main entrypoint and program creation flow, and enhanced logging to clearly indicate which voices support streaming. [1] [2]

Dependency and Testing Updates

  • Added the pysbd library for sentence boundary detection, required for intelligent text chunking.
  • Updated tests to reflect the new function signatures and streaming model parameters, ensuring coverage of both streaming and non-streaming TTS flows. [1] [2] [3] [4] [5] [6]

Resolves #22

roryeckel and others added 4 commits August 9, 2025 13:52
Primary improvement:
- Implement proper text accumulation and segmentation across chunks
- Ensure complete sentences are detected and synthesized immediately
- Maintain sentence boundary detection state during streaming synthesis

Additional refinements:
- Fix duplicate AudioStop/SynthesizeStopped events by adding early return
- Add segmenter caching per language to improve performance
- Enhance error handling documentation and logging
- Add _truncate_for_log helper for consistent log formatting

Related to #22

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…dary chunking

- Add mention of TTS streaming compatibility in overview section
- Add new objective about streaming compatibility bridging Wyoming and OpenAI protocols
- Update sequence diagram to show incremental synthesis with pysbd sentence detection
- Emphasize responsive audio delivery through intelligent text chunking

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@roryeckel roryeckel linked an issue Aug 25, 2025 that may be closed by this pull request
@roryeckel
Copy link
Owner Author

@roryeckel roryeckel requested a review from Copilot August 26, 2025 02:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request introduces incremental TTS streaming support to the Wyoming-OpenAI proxy by implementing intelligent text chunking at sentence boundaries. This enhancement enables responsive audio delivery for Wyoming clients even when the underlying OpenAI-compatible API doesn't support streaming text input.

Key Changes:

  • Added incremental TTS streaming support using pySBD for sentence boundary detection to enable real-time audio delivery
  • Introduced new configuration options for streaming models and chunking parameters
  • Updated the Wyoming/OpenAI compatibility layer to support streaming and non-streaming TTS models separately

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/wyoming_openai/handler.py Implemented core streaming TTS logic with sentence detection and parallel audio synthesis
src/wyoming_openai/compatibility.py Updated TTS voice creation to separate streaming/non-streaming models and support fallback behavior
src/wyoming_openai/main.py Added command-line arguments for streaming configuration and enhanced logging output
pyproject.toml Added pysbd dependency for sentence boundary detection
README.md Updated documentation to describe streaming capabilities and new configuration options
tests/ Updated test files to match new function signatures with streaming model parameters

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@maglat
Copy link

maglat commented Sep 5, 2025

will this be pushed to the master or will be a seperate variant? :)

@roryeckel
Copy link
Owner Author

will this be pushed to the master or will be a seperate variant? :)

There is ghcr.io/roryeckel/wyoming_openai:pr-26 will be available to test and use as a preview until I am ready to send it to main... I am hesitant because I only tested it a little bit, and there's been some progress regarding the official realtime API at OpenAI. I've also got some other projects taking up my time at the moment but don't worry I will get back to it in the near future.
Testing is welcome.
New configuration variables were introduced: https://github.com/roryeckel/wyoming_openai/tree/22-feature-request-incremental-tts-streaming-with-text-chunking?tab=readme-ov-file#configuration-options

@roryeckel
Copy link
Owner Author

My main concern is the collision of "streaming" versus "realtime". I bet we can just do TTS_REALTIME_MODELS or something once realtime is supported, so maybe my concern doesn't make sense.

@maglat
Copy link

maglat commented Sep 6, 2025

Just tested TTS streaming with this PR in combination with Chatterbox and the "Chatterbox-TTS-Server" (see link below). Its working and responses are streamed to Home Assistant. Finally I can use nice voices with relative snappy speech replies.
Thank you very much!

https://github.com/devnen/Chatterbox-TTS-Server

PS: regarding the realtime, I don't know honestly. All I think is, that this kind of realtime first need some kind of support by Home Assistant at first. Right now streaming is the current solution which is officially supported. Don't know what will haben in the future. For sure, having some kind of realtime conversation with Home Assistant would be awesome :D

@roryeckel
Copy link
Owner Author

Thanks for your testing I'm glad you enjoyed. Chatterbox TTS server looks like a cool project I could provide an example compose for in the future as well.

@imkira
Copy link

imkira commented Sep 9, 2025

@roryeckel I really like this work. I was wonder as a future improvement if we could somehow be able to configure settings per model instead of just globally.

@maglat
Copy link

maglat commented Sep 9, 2025

@roryeckel I really like this work. I was wonder as a future improvement if we could somehow be able to configure settings per model instead of just globally.

a hacky workarround right know would be setup multiple instances which have different settings. Not elegant but should work.

@roryeckel
Copy link
Owner Author

@roryeckel I really like this work. I was wonder as a future improvement if we could somehow be able to configure settings per model instead of just globally.

The streaming doesn't happen by default, it has to be configured via TTS_STREAMING_MODELS similar to as you have said. The readme on this code branch indicates these new configuration options

@imkira
Copy link

imkira commented Sep 9, 2025

The streaming doesn't happen by default

@roryeckel Yes, agree. What I meant is that unless you do like @maglat mentioned (spin up more instances of wyoming_openapi), you cannot have more granular settings. Say I have one model for a given set of languages and another for another set of languages. It would be nice to configure the streaming settings per model. One could even go as far and say per language, which is actually what makes more sense (more than model I suppose). Like, maybe for many European languages, it's probably the same, but the thresholds may vary if we have say Asian languages. Does that make sense?

@roryeckel
Copy link
Owner Author

The streaming doesn't happen by default

@roryeckel Yes, agree. What I meant is that unless you do like @maglat mentioned (spin up more instances of wyoming_openapi), you cannot have more granular settings. Say I have one model for a given set of languages and another for another set of languages. It would be nice to configure the streaming settings per model. One could even go as far and say per language, which is actually what makes more sense (more than model I suppose). Like, maybe for many European languages, it's probably the same, but the thresholds may vary if we have say Asian languages. Does that make sense?

The TTS_STREAMING_MODELS expects a list of model names, you would simply exclude the models you don't want streaming and leave them in TTS_MODELS

@imkira
Copy link

imkira commented Sep 9, 2025

The TTS_STREAMING_MODELS expects a list of model names, you would simply exclude the models you don't want streaming and leave them in TTS_MODELS

Yes, I understand. I am not talking about streaming vs non-streaming. I am talking about streaming specifically.

--tts-streaming-min-words, and --tts-streaming-max-chars

Let me call the first parameter X and the second Y.

I am saying for some streaming model 1 being able to define X and Y different from stream model 2's X and Y.
Also, regardless of streaming model, being able to say "for language L I want to define XY different from language M's X and Y".

Makes sense?

@roryeckel
Copy link
Owner Author

But yes for things beyond streaming, multiple instances is kind of the expectation at the moment

@roryeckel roryeckel merged commit a42abac into main Sep 10, 2025
5 checks passed
@roryeckel
Copy link
Owner Author

@roryeckel roryeckel deleted the 22-feature-request-incremental-tts-streaming-with-text-chunking branch September 10, 2025 03:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Incremental TTS Streaming with Text Chunking

4 participants