Feature Request: Incremental TTS Streaming with Text Chunking #26

roryeckel · 2025-08-25T23:20:39Z

This pull request introduces incremental TTS (text-to-speech) streaming support by intelligently chunking text at sentence boundaries, improving responsiveness for Wyoming clients even when the underlying OpenAI-compatible API does not support streaming text input. The implementation adds configuration options for streaming models and chunking parameters, updates the Wyoming/OpenAI compatibility layer, and enhances documentation and tests to reflect these new capabilities.

Incremental TTS Streaming Support

Added incremental TTS streaming compatibility by chunking text at sentence boundaries, enabling responsive audio delivery for Wyoming clients. This is highlighted in the README.md and the system sequence diagram. [1] [2] [3]
Introduced new command-line arguments and environment variables: --tts-streaming-models, --tts-streaming-min-words, and --tts-streaming-max-chars for configuring which models support streaming and how text is chunked. These are documented in the README.md and implemented in the CLI parser. [1] [2] [3]

Core Implementation and API Changes

Updated the TTS program and voice creation logic in compatibility.py to distinguish between streaming and non-streaming models, and to create separate Wyoming TTS programs for each. [1] [2]
Modified the unified API and backend logic to support the new streaming model selection and fallback behavior, ensuring consistent ordering and deduplication of models. [1] [2]
Integrated the new streaming options into the main entrypoint and program creation flow, and enhanced logging to clearly indicate which voices support streaming. [1] [2]

Dependency and Testing Updates

Added the pysbd library for sentence boundary detection, required for intelligent text chunking.
Updated tests to reflect the new function signatures and streaming model parameters, ensuring coverage of both streaming and non-streaming TTS flows. [1] [2] [3] [4] [5] [6]

Resolves #22

Primary improvement: - Implement proper text accumulation and segmentation across chunks - Ensure complete sentences are detected and synthesized immediately - Maintain sentence boundary detection state during streaming synthesis Additional refinements: - Fix duplicate AudioStop/SynthesizeStopped events by adding early return - Add segmenter caching per language to improve performance - Enhance error handling documentation and logging - Add _truncate_for_log helper for consistent log formatting Related to #22 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

…dary chunking - Add mention of TTS streaming compatibility in overview section - Add new objective about streaming compatibility bridging Wyoming and OpenAI protocols - Update sequence diagram to show incremental synthesis with pysbd sentence detection - Emphasize responsive audio delivery through intelligent text chunking 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

…cremental-tts-streaming-with-text-chunking

roryeckel · 2025-08-26T01:15:40Z

https://github.com/roryeckel/wyoming_openai/pkgs/container/wyoming_openai/496306248?tag=pr-26

Copilot

Pull Request Overview

This pull request introduces incremental TTS streaming support to the Wyoming-OpenAI proxy by implementing intelligent text chunking at sentence boundaries. This enhancement enables responsive audio delivery for Wyoming clients even when the underlying OpenAI-compatible API doesn't support streaming text input.

Key Changes:

Added incremental TTS streaming support using pySBD for sentence boundary detection to enable real-time audio delivery
Introduced new configuration options for streaming models and chunking parameters
Updated the Wyoming/OpenAI compatibility layer to support streaming and non-streaming TTS models separately

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
src/wyoming_openai/handler.py	Implemented core streaming TTS logic with sentence detection and parallel audio synthesis
src/wyoming_openai/compatibility.py	Updated TTS voice creation to separate streaming/non-streaming models and support fallback behavior
src/wyoming_openai/main.py	Added command-line arguments for streaming configuration and enhanced logging output
pyproject.toml	Added pysbd dependency for sentence boundary detection
README.md	Updated documentation to describe streaming capabilities and new configuration options
tests/	Updated test files to match new function signatures with streaming model parameters

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

maglat · 2025-09-05T19:30:57Z

will this be pushed to the master or will be a seperate variant? :)

roryeckel · 2025-09-05T22:35:11Z

will this be pushed to the master or will be a seperate variant? :)

There is ghcr.io/roryeckel/wyoming_openai:pr-26 will be available to test and use as a preview until I am ready to send it to main... I am hesitant because I only tested it a little bit, and there's been some progress regarding the official realtime API at OpenAI. I've also got some other projects taking up my time at the moment but don't worry I will get back to it in the near future.
Testing is welcome.
New configuration variables were introduced: https://github.com/roryeckel/wyoming_openai/tree/22-feature-request-incremental-tts-streaming-with-text-chunking?tab=readme-ov-file#configuration-options

roryeckel · 2025-09-05T22:39:17Z

My main concern is the collision of "streaming" versus "realtime". I bet we can just do TTS_REALTIME_MODELS or something once realtime is supported, so maybe my concern doesn't make sense.

maglat · 2025-09-06T20:47:01Z

Just tested TTS streaming with this PR in combination with Chatterbox and the "Chatterbox-TTS-Server" (see link below). Its working and responses are streamed to Home Assistant. Finally I can use nice voices with relative snappy speech replies.
Thank you very much!

https://github.com/devnen/Chatterbox-TTS-Server

PS: regarding the realtime, I don't know honestly. All I think is, that this kind of realtime first need some kind of support by Home Assistant at first. Right now streaming is the current solution which is officially supported. Don't know what will haben in the future. For sure, having some kind of realtime conversation with Home Assistant would be awesome :D

roryeckel · 2025-09-07T21:57:21Z

Thanks for your testing I'm glad you enjoyed. Chatterbox TTS server looks like a cool project I could provide an example compose for in the future as well.

imkira · 2025-09-09T13:14:05Z

@roryeckel I really like this work. I was wonder as a future improvement if we could somehow be able to configure settings per model instead of just globally.

maglat · 2025-09-09T18:03:49Z

@roryeckel I really like this work. I was wonder as a future improvement if we could somehow be able to configure settings per model instead of just globally.

a hacky workarround right know would be setup multiple instances which have different settings. Not elegant but should work.

roryeckel · 2025-09-09T18:10:03Z

@roryeckel I really like this work. I was wonder as a future improvement if we could somehow be able to configure settings per model instead of just globally.

The streaming doesn't happen by default, it has to be configured via TTS_STREAMING_MODELS similar to as you have said. The readme on this code branch indicates these new configuration options

imkira · 2025-09-09T18:15:11Z

The streaming doesn't happen by default

@roryeckel Yes, agree. What I meant is that unless you do like @maglat mentioned (spin up more instances of wyoming_openapi), you cannot have more granular settings. Say I have one model for a given set of languages and another for another set of languages. It would be nice to configure the streaming settings per model. One could even go as far and say per language, which is actually what makes more sense (more than model I suppose). Like, maybe for many European languages, it's probably the same, but the thresholds may vary if we have say Asian languages. Does that make sense?

roryeckel · 2025-09-09T18:21:16Z

The streaming doesn't happen by default

@roryeckel Yes, agree. What I meant is that unless you do like @maglat mentioned (spin up more instances of wyoming_openapi), you cannot have more granular settings. Say I have one model for a given set of languages and another for another set of languages. It would be nice to configure the streaming settings per model. One could even go as far and say per language, which is actually what makes more sense (more than model I suppose). Like, maybe for many European languages, it's probably the same, but the thresholds may vary if we have say Asian languages. Does that make sense?

The TTS_STREAMING_MODELS expects a list of model names, you would simply exclude the models you don't want streaming and leave them in TTS_MODELS

imkira · 2025-09-09T18:26:01Z

The TTS_STREAMING_MODELS expects a list of model names, you would simply exclude the models you don't want streaming and leave them in TTS_MODELS

Yes, I understand. I am not talking about streaming vs non-streaming. I am talking about streaming specifically.

--tts-streaming-min-words, and --tts-streaming-max-chars

Let me call the first parameter X and the second Y.

I am saying for some streaming model 1 being able to define X and Y different from stream model 2's X and Y.
Also, regardless of streaming model, being able to say "for language L I want to define XY different from language M's X and Y".

Makes sense?

roryeckel · 2025-09-09T18:26:15Z

But yes for things beyond streaming, multiple instances is kind of the expectation at the moment

roryeckel · 2025-09-10T03:19:29Z

https://github.com/roryeckel/wyoming_openai/releases/tag/v0.3.7

roryeckel and others added 4 commits August 9, 2025 13:52

DRAFT: Chunked streaming powered by pysbd

387541a

Resolve linting

1743cd9

roryeckel linked an issue Aug 25, 2025 that may be closed by this pull request

Feature Request: Incremental TTS Streaming with Text Chunking #22

Closed

Merge remote-tracking branch 'origin/main' into 22-feature-request-in…

738544f

…cremental-tts-streaming-with-text-chunking

roryeckel requested a review from Copilot August 26, 2025 02:16

Copilot AI reviewed Aug 26, 2025

View reviewed changes

Version -> 0.3.7 + minor wording change

5c8603f

roryeckel merged commit a42abac into main Sep 10, 2025
5 checks passed

roryeckel deleted the 22-feature-request-incremental-tts-streaming-with-text-chunking branch September 10, 2025 03:27

Uh oh!

Feature Request: Incremental TTS Streaming with Text Chunking #26

Feature Request: Incremental TTS Streaming with Text Chunking #26

Uh oh!

Conversation

roryeckel commented Aug 25, 2025

Uh oh!

roryeckel commented Aug 26, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

maglat commented Sep 5, 2025

Uh oh!

roryeckel commented Sep 5, 2025

Uh oh!

roryeckel commented Sep 5, 2025

Uh oh!

maglat commented Sep 6, 2025

Uh oh!

roryeckel commented Sep 7, 2025

Uh oh!

imkira commented Sep 9, 2025

Uh oh!

maglat commented Sep 9, 2025

Uh oh!

roryeckel commented Sep 9, 2025

Uh oh!

imkira commented Sep 9, 2025

Uh oh!

roryeckel commented Sep 9, 2025

Uh oh!

imkira commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roryeckel commented Sep 9, 2025

Uh oh!

Uh oh!

roryeckel commented Sep 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

imkira commented Sep 9, 2025 •

edited

Loading