Skip to content

Commit a42abac

Browse files
authored
Merge pull request #26 from roryeckel/22-feature-request-incremental-tts-streaming-with-text-chunking
Feature Request: Incremental TTS Streaming with Text Chunking
2 parents 45088fd + 5c8603f commit a42abac

File tree

9 files changed

+900
-131
lines changed

9 files changed

+900
-131
lines changed

README.md

Lines changed: 22 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ Note: This project is not affiliated with OpenAI or the Wyoming project.
1010

1111
## Overview
1212

13-
This project introduces a [Wyoming](https://github.com/OHF-Voice/wyoming) server that connects to OpenAI-compatible endpoints of your choice. Like a proxy, it enables Wyoming clients such as the [Home Assistant Wyoming Integration](https://www.home-assistant.io/integrations/wyoming/) to use the transcription (Automatic Speech Recognition - ASR) and text-to-speech synthesis (TTS) capabilities of various OpenAI-compatible projects. By acting as a bridge between the Wyoming protocol and OpenAI, you can consolidate the resource usage on your server and extend the capabilities of Home Assistant.
13+
This project introduces a [Wyoming](https://github.com/OHF-Voice/wyoming) server that connects to OpenAI-compatible endpoints of your choice. Like a proxy, it enables Wyoming clients such as the [Home Assistant Wyoming Integration](https://www.home-assistant.io/integrations/wyoming/) to use the transcription (Automatic Speech Recognition - ASR) and text-to-speech synthesis (TTS) capabilities of various OpenAI-compatible projects. By acting as a bridge between the Wyoming protocol and OpenAI, you can consolidate the resource usage on your server and extend the capabilities of Home Assistant. The proxy now provides incremental TTS streaming compatibility by intelligently chunking text at sentence boundaries for responsive audio delivery.
1414

1515
## Featured Models
1616

@@ -28,7 +28,8 @@ This project features a variety of examples for using cutting-edge models in bot
2828
2. **Service Consolidation**: Allow users of various programs to run inference on a single server without needing separate instances for each service.
2929
Example: Sharing TTS/STT services between [Open WebUI](#open-webui) and [Home Assistant](#usage-in-home-assistant).
3030
3. **Asynchronous Processing**: Enable efficient handling of multiple requests by supporting asynchronous processing of audio streams.
31-
4. **Simple Setup with Docker**: Provide a straightforward deployment process using [Docker and Docker Compose](#docker-recommended) for OpenAI and various popular open source projects.
31+
4. **Streaming Compatibility**: Bridge Wyoming's streaming TTS protocol with OpenAI-compatible APIs through intelligent sentence boundary chunking, enabling responsive incremental audio delivery even when the underlying API doesn't support streaming text input.
32+
5. **Simple Setup with Docker**: Provide a straightforward deployment process using [Docker and Docker Compose](#docker-recommended) for OpenAI and various popular open source projects.
3233

3334
## Terminology
3435

@@ -112,6 +113,7 @@ python -m wyoming_openai \
112113
--tts-openai-key YOUR_TTS_API_KEY_HERE \
113114
--tts-openai-url https://api.openai.com/v1 \
114115
--tts-models gpt-4o-mini-tts tts-1-hd tts-1 \
116+
--tts-streaming-models tts-1 \
115117
--tts-voices alloy ash coral echo fable onyx nova sage shimmer \
116118
--tts-backend OPENAI \
117119
--tts-speed 1.0
@@ -142,6 +144,9 @@ In addition to using command-line arguments, you can configure the Wyoming OpenA
142144
| `--tts-backend` | `TTS_BACKEND` | None (autodetected) | Enable unofficial API feature sets. |
143145
| `--tts-speed` | `TTS_SPEED` | None (autodetected) | Speed of the TTS output (ranges from 0.25 to 4.0). |
144146
| `--tts-instructions` | `TTS_INSTRUCTIONS` | None | Optional instructions for TTS requests (Control the voice). |
147+
| `--tts-streaming-models` | `TTS_STREAMING_MODELS` | None | Space-separated list of TTS models to enable incremental streaming via pysbd text chunking (e.g. `tts-1`). |
148+
| `--tts-streaming-min-words` | `TTS_STREAMING_MIN_WORDS` | None | Minimum words per text chunk for incremental TTS streaming (optional). |
149+
| `--tts-streaming-max-chars` | `TTS_STREAMING_MAX_CHARS` | None | Maximum characters per text chunk for incremental TTS streaming (optional). |
145150
146151
## Docker (Recommended) [![Docker Image CI](https://github.com/roryeckel/wyoming-openai/actions/workflows/docker-image.yml/badge.svg)](https://github.com/roryeckel/wyoming-openai/actions/workflows/docker-image.yml)
147152
@@ -275,7 +280,7 @@ We follow specific tagging conventions for our Docker images. These tags help in
275280

276281
- **`main`**: This tag points to the latest commit on the main code branch. It is suitable for users who want to experiment with the most up-to-date features and changes, but may include unstable or experimental code.
277282

278-
- **`major.minor.patch version`**: Specific version tags (e.g., `0.3.6`) correspond to specific stable releases of the Wyoming OpenAI proxy server. These tags are ideal for users who need a consistent, reproducible environment and want to avoid breaking changes introduced in newer versions.
283+
- **`major.minor.patch version`**: Specific version tags (e.g., `0.3.7`) correspond to specific stable releases of the Wyoming OpenAI proxy server. These tags are ideal for users who need a consistent, reproducible environment and want to avoid breaking changes introduced in newer versions.
279284

280285
- **`major.minor version`**: Tags that follow the `major.minor` format (e.g., `0.3`) represent a range of patch-level updates within the same minor version series. These tags are useful for users who want to stay updated with bug fixes and minor improvements without upgrading to a new major or minor version.
281286

@@ -376,15 +381,25 @@ sequenceDiagram
376381
WY->>HA: AudioStop event
377382
else Streaming TTS (SynthesizeStart/Chunk/Stop)
378383
HA->>WY: SynthesizeStart event (voice config)
379-
Note over WY: Initialize synthesis buffer
384+
Note over WY: Initialize incremental synthesis<br/>with sentence boundary detection
385+
WY->>HA: AudioStart event
380386
loop Sending text chunks
381387
HA->>WY: SynthesizeChunk events
382-
Note over WY: Append to synthesis buffer
388+
Note over WY: Accumulate text and detect<br/>complete sentences using pysbd
389+
alt Complete sentences detected
390+
loop For each complete sentence
391+
WY->>OAPI: Speech synthesis request
392+
loop While receiving audio data
393+
OAPI-->>WY: Audio stream chunks
394+
WY-->>HA: AudioChunk events (incremental)
395+
end
396+
end
397+
end
383398
end
384399
HA->>WY: SynthesizeStop event
385-
Note over WY: No-op — OpenAI `/v1/audio/speech`<br/>does not support streaming text input
400+
Note over WY: Process any remaining text<br/>and finalize synthesis
401+
WY->>HA: AudioStop event
386402
WY->>HA: SynthesizeStopped event
387-
Note over WY: Streaming flow is handled<br/>but not advertised in capabilities
388403
end
389404
```
390405

pyproject.toml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "wyoming_openai"
7-
version = "0.3.6"
7+
version = "0.3.7"
88
description = "OpenAI-Compatible Proxy Middleware for the Wyoming Protocol"
99
authors = [
1010
{ name = "Rory Eckel" }
@@ -21,8 +21,9 @@ classifiers = [
2121
"Operating System :: OS Independent",
2222
]
2323
dependencies = [
24-
"openai==1.98.0",
25-
"wyoming==1.7.2"
24+
"openai==1.107.0",
25+
"wyoming==1.7.2",
26+
"pysbd==0.3.4"
2627
]
2728

2829
[project.urls]

src/wyoming_openai/__main__.py

Lines changed: 44 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -139,6 +139,24 @@ async def main():
139139
default=os.getenv("TTS_INSTRUCTIONS", None),
140140
help="Optional instructions for TTS requests"
141141
)
142+
parser.add_argument(
143+
"--tts-streaming-models",
144+
nargs="+",
145+
default=os.getenv("TTS_STREAMING_MODELS", '').split(),
146+
help="Space-separated list of TTS model names that support streaming synthesis (e.g. tts-1)"
147+
)
148+
parser.add_argument(
149+
"--tts-streaming-min-words",
150+
type=int,
151+
default=int(os.getenv("TTS_STREAMING_MIN_WORDS")) if os.getenv("TTS_STREAMING_MIN_WORDS") else None,
152+
help="Minimum words per chunk for streaming TTS (optional)"
153+
)
154+
parser.add_argument(
155+
"--tts-streaming-max-chars",
156+
type=int,
157+
default=int(os.getenv("TTS_STREAMING_MAX_CHARS")) if os.getenv("TTS_STREAMING_MAX_CHARS") else None,
158+
help="Maximum characters per chunk for streaming TTS (optional)"
159+
)
142160

143161
args = parser.parse_args()
144162

@@ -179,12 +197,12 @@ async def main():
179197

180198
if args.tts_voices:
181199
# If TTS_VOICES is set, use that
182-
tts_voices = create_tts_voices(args.tts_models, args.tts_voices, args.tts_openai_url, args.languages)
200+
tts_voices = create_tts_voices(args.tts_models, args.tts_streaming_models, args.tts_voices, args.tts_openai_url, args.languages)
183201
else:
184-
# Otherwise, list supported voices via defaults
185-
tts_voices = await tts_client.list_supported_voices(args.tts_models, args.languages)
202+
# Otherwise, list supported voices via backend (with streaming fallback)
203+
tts_voices = await tts_client.list_supported_voices(args.tts_models, args.tts_streaming_models, args.languages)
186204

187-
tts_programs = create_tts_programs(tts_voices)
205+
tts_programs = create_tts_programs(tts_voices, tts_streaming_models=args.tts_streaming_models)
188206

189207
# Ensure at least one model is specified
190208
if not asr_programs and not tts_programs:
@@ -218,8 +236,25 @@ async def main():
218236
_logger.warning("No ASR models specified")
219237

220238
if tts_programs:
221-
all_tts_voices = [voice for prog in tts_programs for voice in prog.voices]
222-
_logger.info("*** TTS Voices ***\n%s", "\n".join(tts_voice_to_string(x) for x in all_tts_voices))
239+
streaming_tts_voices_for_logging = []
240+
non_streaming_tts_voices_for_logging = []
241+
242+
for prog in tts_programs:
243+
for voice in prog.voices:
244+
if getattr(prog, 'supports_synthesize_streaming', False):
245+
streaming_tts_voices_for_logging.append(voice)
246+
else:
247+
non_streaming_tts_voices_for_logging.append(voice)
248+
249+
if streaming_tts_voices_for_logging:
250+
_logger.info("*** Streaming TTS Voices ***\n%s", "\n".join(tts_voice_to_string(x) for x in streaming_tts_voices_for_logging))
251+
else:
252+
_logger.info("No Streaming TTS voices specified")
253+
254+
if non_streaming_tts_voices_for_logging:
255+
_logger.info("*** Non-Streaming TTS Voices ***\n%s", "\n".join(tts_voice_to_string(x) for x in non_streaming_tts_voices_for_logging))
256+
else:
257+
_logger.info("No Non-Streaming TTS voices specified")
223258
else:
224259
_logger.warning("No TTS models specified")
225260

@@ -238,7 +273,9 @@ async def main():
238273
stt_temperature=args.stt_temperature,
239274
tts_speed=args.tts_speed,
240275
tts_instructions=args.tts_instructions,
241-
stt_prompt=args.stt_prompt
276+
stt_prompt=args.stt_prompt,
277+
tts_streaming_min_words=args.tts_streaming_min_words,
278+
tts_streaming_max_chars=args.tts_streaming_max_chars
242279
)
243280
)
244281

src/wyoming_openai/compatibility.py

Lines changed: 83 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -116,24 +116,44 @@ def create_asr_programs(
116116

117117
def create_tts_voices(
118118
tts_models: list[str],
119+
tts_streaming_models: list[str],
119120
tts_voices: list[str],
120121
tts_url: str,
121122
languages: list[str]
122123
) -> list[TtsVoiceModel]:
123124
"""
124125
Creates a list of TTS (Text-to-Speech) voice models in the Wyoming Protocol format.
126+
Uses streaming models as fallback if regular models not specified (consistent with ASR behavior).
125127
126128
Args:
127129
tts_models (list[str]): A list of TTS model identifiers.
130+
tts_streaming_models (list[str]): A list of TTS streaming model identifiers.
128131
tts_voices (list[str]): A list of voice identifiers.
129132
tts_url (str): The URL for the TTS service attribution.
130133
languages (list[str]): A list of supported languages.
131134
132135
Returns:
133136
list[TtsVoiceModel]: A list of Wyoming TtsVoiceModel instances.
134137
"""
135-
voices = []
138+
# Create ordered list: streaming models first, then non-streaming, preserving natural order and deduplicating
139+
# (same pattern as create_asr_programs)
140+
seen = set()
141+
ordered_models = []
142+
143+
# Add streaming models first
144+
for model_name in tts_streaming_models:
145+
if model_name not in seen:
146+
ordered_models.append(model_name)
147+
seen.add(model_name)
148+
149+
# Add non-streaming models
136150
for model_name in tts_models:
151+
if model_name not in seen:
152+
ordered_models.append(model_name)
153+
seen.add(model_name)
154+
155+
voices = []
156+
for model_name in ordered_models:
137157
for voice in tts_voices:
138158
voices.append(TtsVoiceModel(
139159
name=voice,
@@ -149,32 +169,64 @@ def create_tts_voices(
149169
))
150170
return voices
151171

152-
def create_tts_programs(tts_voices: list[TtsVoiceModel]) -> list[TtsProgram]:
172+
def create_tts_programs(tts_voices: list[TtsVoiceModel], tts_streaming_models: list[str] = None) -> list[TtsProgram]:
153173
"""
154-
Create TTS programs from a list of voices.
174+
Create TTS programs from a list of voices, separating voices based on streaming model support.
155175
156176
Args:
157177
tts_voices (list[TtsVoiceModel]): A list of TTS voice models.
178+
tts_streaming_models (list[str]): List of TTS model names that support streaming.
158179
159180
Returns:
160181
list[TtsProgram]: A list of Wyoming TTS programs.
161182
"""
162183
if not tts_voices:
163184
return []
164185

165-
return [
166-
TtsProgram(
186+
if tts_streaming_models is None:
187+
tts_streaming_models = []
188+
189+
# Separate streaming and non-streaming voices based on their models
190+
streaming_tts_voices = []
191+
non_streaming_tts_voices = []
192+
193+
for voice in tts_voices:
194+
if voice.model_name in tts_streaming_models:
195+
streaming_tts_voices.append(voice)
196+
else:
197+
non_streaming_tts_voices.append(voice)
198+
199+
programs = []
200+
201+
if streaming_tts_voices:
202+
programs.append(TtsProgram(
203+
name="openai-streaming",
204+
description="OpenAI (Streaming)",
205+
attribution=Attribution(
206+
name=ATTRIBUTION_NAME_PROGRAM_STREAMING,
207+
url=ATTRIBUTION_URL,
208+
),
209+
installed=True,
210+
version=__version__,
211+
voices=streaming_tts_voices,
212+
supports_synthesize_streaming=True,
213+
))
214+
215+
if non_streaming_tts_voices:
216+
programs.append(TtsProgram(
167217
name="openai",
168-
description="OpenAI",
218+
description="OpenAI (Non-Streaming)",
169219
attribution=Attribution(
170220
name=ATTRIBUTION_NAME_PROGRAM,
171221
url=ATTRIBUTION_URL,
172222
),
173223
installed=True,
174224
version=__version__,
175-
voices=tts_voices,
176-
)
177-
]
225+
voices=non_streaming_tts_voices,
226+
supports_synthesize_streaming=False,
227+
))
228+
229+
return programs
178230

179231

180232
def create_info(asr_programs: list[AsrProgram], tts_programs: list[TtsProgram]) -> Info:
@@ -404,16 +456,30 @@ async def _list_speaches_voices(self, model_name: str) -> list[str]:
404456

405457
# Unified API
406458

407-
async def list_supported_voices(self, model_names: str | list[str], languages: list[str]) -> list[TtsVoiceModel]:
459+
async def list_supported_voices(self, model_names: list[str], streaming_model_names: list[str], languages: list[str]) -> list[TtsVoiceModel]:
408460
"""
409-
Fetches the available voices via unofficial specs.
461+
Fetches the available voices via unofficial specs with streaming model fallback (consistent with ASR behavior).
462+
Uses streaming models if regular models not specified.
410463
Note: this is not the list of CONFIGURED voices.
411464
"""
412-
if isinstance(model_names, str):
413-
model_names = [model_names]
465+
# Use the same fallback pattern as create_asr_programs
466+
seen = set()
467+
ordered_models = []
414468

415-
tts_voice_models = []
469+
# Add streaming models first
470+
for model_name in streaming_model_names:
471+
if model_name not in seen:
472+
ordered_models.append(model_name)
473+
seen.add(model_name)
474+
475+
# Add non-streaming models
416476
for model_name in model_names:
477+
if model_name not in seen:
478+
ordered_models.append(model_name)
479+
seen.add(model_name)
480+
481+
tts_voice_models = []
482+
for model_name in ordered_models:
417483
if self.backend == OpenAIBackend.OPENAI:
418484
tts_voices = await self.list_openai_voices()
419485
elif self.backend == OpenAIBackend.SPEACHES:
@@ -426,15 +492,17 @@ async def list_supported_voices(self, model_names: str | list[str], languages: l
426492
_LOGGER.warning("Unknown backend: %s", self.backend)
427493
continue
428494

429-
# Create TTS voices in Wyoming Protocol format
495+
# Create TTS voices in Wyoming Protocol format, preserving streaming model info
430496
tts_voice_models.extend(create_tts_voices(
431497
tts_models=[model_name],
498+
tts_streaming_models=streaming_model_names,
432499
tts_voices=tts_voices,
433500
tts_url=str(self.base_url),
434501
languages=languages
435502
))
436503
return tts_voice_models
437504

505+
438506
@classmethod
439507
def create_autodetected_factory(cls):
440508
"""

0 commit comments

Comments
 (0)