roryeckel · roryeckel · Sep 10, 2025 · Aug 9, 2025 · Aug 12, 2025 · Aug 12, 2025
diff --git a/README.md b/README.md
@@ -10,7 +10,7 @@ Note: This project is not affiliated with OpenAI or the Wyoming project.
 
 ## Overview
 
-This project introduces a [Wyoming](https://github.com/OHF-Voice/wyoming) server that connects to OpenAI-compatible endpoints of your choice. Like a proxy, it enables Wyoming clients such as the [Home Assistant Wyoming Integration](https://www.home-assistant.io/integrations/wyoming/) to use the transcription (Automatic Speech Recognition - ASR) and text-to-speech synthesis (TTS) capabilities of various OpenAI-compatible projects. By acting as a bridge between the Wyoming protocol and OpenAI, you can consolidate the resource usage on your server and extend the capabilities of Home Assistant.
+This project introduces a [Wyoming](https://github.com/OHF-Voice/wyoming) server that connects to OpenAI-compatible endpoints of your choice. Like a proxy, it enables Wyoming clients such as the [Home Assistant Wyoming Integration](https://www.home-assistant.io/integrations/wyoming/) to use the transcription (Automatic Speech Recognition - ASR) and text-to-speech synthesis (TTS) capabilities of various OpenAI-compatible projects. By acting as a bridge between the Wyoming protocol and OpenAI, you can consolidate the resource usage on your server and extend the capabilities of Home Assistant. The proxy now provides incremental TTS streaming compatibility by intelligently chunking text at sentence boundaries for responsive audio delivery.
 
 ## Featured Models
 
@@ -28,7 +28,8 @@ This project features a variety of examples for using cutting-edge models in bot
 2. **Service Consolidation**: Allow users of various programs to run inference on a single server without needing separate instances for each service.
 Example: Sharing TTS/STT services between [Open WebUI](#open-webui) and [Home Assistant](#usage-in-home-assistant).
 3. **Asynchronous Processing**: Enable efficient handling of multiple requests by supporting asynchronous processing of audio streams.
-4. **Simple Setup with Docker**: Provide a straightforward deployment process using [Docker and Docker Compose](#docker-recommended) for OpenAI and various popular open source projects.
+4. **Streaming Compatibility**: Bridge Wyoming's streaming TTS protocol with OpenAI-compatible APIs through intelligent sentence boundary chunking, enabling responsive incremental audio delivery even when the underlying API doesn't support streaming text input.
+5. **Simple Setup with Docker**: Provide a straightforward deployment process using [Docker and Docker Compose](#docker-recommended) for OpenAI and various popular open source projects.
 
 ## Terminology
 
@@ -112,6 +113,7 @@ python -m wyoming_openai \
   --tts-openai-key YOUR_TTS_API_KEY_HERE \
   --tts-openai-url https://api.openai.com/v1 \
   --tts-models gpt-4o-mini-tts tts-1-hd tts-1 \
+  --tts-streaming-models tts-1 \
   --tts-voices alloy ash coral echo fable onyx nova sage shimmer \
   --tts-backend OPENAI \
   --tts-speed 1.0
@@ -142,6 +144,9 @@ In addition to using command-line arguments, you can configure the Wyoming OpenA
 | `--tts-backend`                         | `TTS_BACKEND`                              | None (autodetected)                           | Enable unofficial API feature sets.          |
 | `--tts-speed`                           | `TTS_SPEED`                                | None (autodetected)                           | Speed of the TTS output (ranges from 0.25 to 4.0).               |
 | `--tts-instructions`                    | `TTS_INSTRUCTIONS`                         | None                                          | Optional instructions for TTS requests (Control the voice).    |
+| `--tts-streaming-models`                | `TTS_STREAMING_MODELS`                     | None                                          | Space-separated list of TTS models to enable incremental streaming via text chunking (e.g. `tts-1`). |
+| `--tts-streaming-min-words`             | `TTS_STREAMING_MIN_WORDS`                  | None                                          | Minimum words per text chunk for incremental TTS streaming (optional). |
+| `--tts-streaming-max-chars`             | `TTS_STREAMING_MAX_CHARS`                  | None                                          | Maximum characters per text chunk for incremental TTS streaming (optional). |
 
 ## Docker (Recommended) [![Docker Image CI](https://github.com/roryeckel/wyoming-openai/actions/workflows/docker-image.yml/badge.svg)](https://github.com/roryeckel/wyoming-openai/actions/workflows/docker-image.yml)
 
@@ -376,15 +381,25 @@ sequenceDiagram
     WY->>HA: AudioStop event
   else Streaming TTS (SynthesizeStart/Chunk/Stop)
     HA->>WY: SynthesizeStart event (voice config)
-    Note over WY: Initialize synthesis buffer
+    Note over WY: Initialize incremental synthesis<br/>with sentence boundary detection
+    WY->>HA: AudioStart event
     loop Sending text chunks
       HA->>WY: SynthesizeChunk events
-      Note over WY: Append to synthesis buffer
+      Note over WY: Accumulate text and detect<br/>complete sentences using pysbd
+      alt Complete sentences detected
+        loop For each complete sentence
+          WY->>OAPI: Speech synthesis request
+          loop While receiving audio data
+            OAPI-->>WY: Audio stream chunks
+            WY-->>HA: AudioChunk events (incremental)
+          end
+        end
+      end
     end
     HA->>WY: SynthesizeStop event
-    Note over WY: No-op — OpenAI `/v1/audio/speech`<br/>does not support streaming text input
+    Note over WY: Process any remaining text<br/>and finalize synthesis
+    WY->>HA: AudioStop event
     WY->>HA: SynthesizeStopped event
-    Note over WY: Streaming flow is handled<br/>but not advertised in capabilities
   end
 ```
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -22,7 +22,8 @@ classifiers = [
 ]
 dependencies = [
     "openai==1.98.0",
-    "wyoming==1.7.2"
+    "wyoming==1.7.2",
+    "pysbd==0.3.4"
 ]
 
 [project.urls]

diff --git a/src/wyoming_openai/__main__.py b/src/wyoming_openai/__main__.py
@@ -139,6 +139,24 @@ async def main():
         default=os.getenv("TTS_INSTRUCTIONS", None),
         help="Optional instructions for TTS requests"
     )
+    parser.add_argument(
+        "--tts-streaming-models",
+        nargs="+",
+        default=os.getenv("TTS_STREAMING_MODELS", '').split(),
+        help="Space-separated list of TTS model names that support streaming synthesis (e.g. tts-1)"
+    )
+    parser.add_argument(
+        "--tts-streaming-min-words",
+        type=int,
+        default=int(os.getenv("TTS_STREAMING_MIN_WORDS")) if os.getenv("TTS_STREAMING_MIN_WORDS") else None,
+        help="Minimum words per chunk for streaming TTS (optional)"
+    )
+    parser.add_argument(
+        "--tts-streaming-max-chars",
+        type=int,
+        default=int(os.getenv("TTS_STREAMING_MAX_CHARS")) if os.getenv("TTS_STREAMING_MAX_CHARS") else None,
+        help="Maximum characters per chunk for streaming TTS (optional)"
+    )
 
     args = parser.parse_args()
 
@@ -179,12 +197,12 @@ async def main():
 
         if args.tts_voices:
             # If TTS_VOICES is set, use that
-            tts_voices = create_tts_voices(args.tts_models, args.tts_voices, args.tts_openai_url, args.languages)
+            tts_voices = create_tts_voices(args.tts_models, args.tts_streaming_models, args.tts_voices, args.tts_openai_url, args.languages)
         else:
-            # Otherwise, list supported voices via defaults
-            tts_voices = await tts_client.list_supported_voices(args.tts_models, args.languages)
+            # Otherwise, list supported voices via backend (with streaming fallback)
+            tts_voices = await tts_client.list_supported_voices(args.tts_models, args.tts_streaming_models, args.languages)
 
-        tts_programs = create_tts_programs(tts_voices)
+        tts_programs = create_tts_programs(tts_voices, tts_streaming_models=args.tts_streaming_models)
 
         # Ensure at least one model is specified
         if not asr_programs and not tts_programs:
@@ -218,8 +236,25 @@ async def main():
             _logger.warning("No ASR models specified")
 
         if tts_programs:
-            all_tts_voices = [voice for prog in tts_programs for voice in prog.voices]
-            _logger.info("*** TTS Voices ***\n%s", "\n".join(tts_voice_to_string(x) for x in all_tts_voices))
+            streaming_tts_voices_for_logging = []
+            non_streaming_tts_voices_for_logging = []
+
+            for prog in tts_programs:
+                for voice in prog.voices:
+                    if getattr(prog, 'supports_synthesize_streaming', False):
+                        streaming_tts_voices_for_logging.append(voice)
+                    else:
+                        non_streaming_tts_voices_for_logging.append(voice)
+
+            if streaming_tts_voices_for_logging:
+                _logger.info("*** Streaming TTS Voices ***\n%s", "\n".join(tts_voice_to_string(x) for x in streaming_tts_voices_for_logging))
+            else:
+                _logger.info("No Streaming TTS voices specified")
+
+            if non_streaming_tts_voices_for_logging:
+                _logger.info("*** Non-Streaming TTS Voices ***\n%s", "\n".join(tts_voice_to_string(x) for x in non_streaming_tts_voices_for_logging))
+            else:
+                _logger.info("No Non-Streaming TTS voices specified")
         else:
             _logger.warning("No TTS models specified")
 
@@ -238,7 +273,9 @@ async def main():
                 stt_temperature=args.stt_temperature,
                 tts_speed=args.tts_speed,
                 tts_instructions=args.tts_instructions,
-                stt_prompt=args.stt_prompt
+                stt_prompt=args.stt_prompt,
+                tts_streaming_min_words=args.tts_streaming_min_words,
+                tts_streaming_max_chars=args.tts_streaming_max_chars
             )
         )
 

diff --git a/src/wyoming_openai/compatibility.py b/src/wyoming_openai/compatibility.py
@@ -116,24 +116,44 @@ def create_asr_programs(
 
 def create_tts_voices(
     tts_models: list[str],
+    tts_streaming_models: list[str],
     tts_voices: list[str],
     tts_url: str,
     languages: list[str]
 ) -> list[TtsVoiceModel]:
     """
     Creates a list of TTS (Text-to-Speech) voice models in the Wyoming Protocol format.
+    Uses streaming models as fallback if regular models not specified (consistent with ASR behavior).
 
     Args:
         tts_models (list[str]): A list of TTS model identifiers.
+        tts_streaming_models (list[str]): A list of TTS streaming model identifiers.
         tts_voices (list[str]): A list of voice identifiers.
         tts_url (str): The URL for the TTS service attribution.
         languages (list[str]): A list of supported languages.
 
     Returns:
         list[TtsVoiceModel]: A list of Wyoming TtsVoiceModel instances.
     """
-    voices = []
+    # Create ordered list: streaming models first, then non-streaming, preserving natural order and deduplicating
+    # (same pattern as create_asr_programs)
+    seen = set()
+    ordered_models = []
+
+    # Add streaming models first
+    for model_name in tts_streaming_models:
+        if model_name not in seen:
+            ordered_models.append(model_name)
+            seen.add(model_name)
+
+    # Add non-streaming models
     for model_name in tts_models:
+        if model_name not in seen:
+            ordered_models.append(model_name)
+            seen.add(model_name)
+
+    voices = []
+    for model_name in ordered_models:
         for voice in tts_voices:
             voices.append(TtsVoiceModel(
                 name=voice,
@@ -149,32 +169,64 @@ def create_tts_voices(
             ))
     return voices
 
-def create_tts_programs(tts_voices: list[TtsVoiceModel]) -> list[TtsProgram]:
+def create_tts_programs(tts_voices: list[TtsVoiceModel], tts_streaming_models: list[str] = None) -> list[TtsProgram]:
     """
-    Create TTS programs from a list of voices.
+    Create TTS programs from a list of voices, separating voices based on streaming model support.
 
     Args:
         tts_voices (list[TtsVoiceModel]): A list of TTS voice models.
+        tts_streaming_models (list[str]): List of TTS model names that support streaming.
 
     Returns:
         list[TtsProgram]: A list of Wyoming TTS programs.
     """
     if not tts_voices:
         return []
 
-    return [
-        TtsProgram(
+    if tts_streaming_models is None:
+        tts_streaming_models = []
+
+    # Separate streaming and non-streaming voices based on their models
+    streaming_tts_voices = []
+    non_streaming_tts_voices = []
+
+    for voice in tts_voices:
+        if voice.model_name in tts_streaming_models:
+            streaming_tts_voices.append(voice)
+        else:
+            non_streaming_tts_voices.append(voice)
+
+    programs = []
+
+    if streaming_tts_voices:
+        programs.append(TtsProgram(
+            name="openai-streaming",
+            description="OpenAI (Streaming)",
+            attribution=Attribution(
+                name=ATTRIBUTION_NAME_PROGRAM_STREAMING,
+                url=ATTRIBUTION_URL,
+            ),
+            installed=True,
+            version=__version__,
+            voices=streaming_tts_voices,
+            supports_synthesize_streaming=True,
+        ))
+
+    if non_streaming_tts_voices:
+        programs.append(TtsProgram(
             name="openai",
-            description="OpenAI",
+            description="OpenAI (Non-Streaming)",
             attribution=Attribution(
                 name=ATTRIBUTION_NAME_PROGRAM,
                 url=ATTRIBUTION_URL,
             ),
             installed=True,
             version=__version__,
-            voices=tts_voices,
-        )
-    ]
+            voices=non_streaming_tts_voices,
+            supports_synthesize_streaming=False,
+        ))
+
+    return programs
 
 
 def create_info(asr_programs: list[AsrProgram], tts_programs: list[TtsProgram]) -> Info:
@@ -404,16 +456,30 @@ async def _list_speaches_voices(self, model_name: str) -> list[str]:
 
     # Unified API
 
-    async def list_supported_voices(self, model_names: str | list[str], languages: list[str]) -> list[TtsVoiceModel]:
+    async def list_supported_voices(self, model_names: list[str], streaming_model_names: list[str], languages: list[str]) -> list[TtsVoiceModel]:
         """
-        Fetches the available voices via unofficial specs.
+        Fetches the available voices via unofficial specs with streaming model fallback (consistent with ASR behavior).
+        Uses streaming models if regular models not specified.
         Note: this is not the list of CONFIGURED voices.
         """
-        if isinstance(model_names, str):
-            model_names = [model_names]
+        # Use the same fallback pattern as create_asr_programs
+        seen = set()
+        ordered_models = []
 
-        tts_voice_models = []
+        # Add streaming models first
+        for model_name in streaming_model_names:
+            if model_name not in seen:
+                ordered_models.append(model_name)
+                seen.add(model_name)
+
+        # Add non-streaming models
         for model_name in model_names:
+            if model_name not in seen:
+                ordered_models.append(model_name)
+                seen.add(model_name)
+
+        tts_voice_models = []
+        for model_name in ordered_models:
             if self.backend == OpenAIBackend.OPENAI:
                 tts_voices = await self.list_openai_voices()
             elif self.backend == OpenAIBackend.SPEACHES:
@@ -426,15 +492,17 @@ async def list_supported_voices(self, model_names: str | list[str], languages: l
                 _LOGGER.warning("Unknown backend: %s", self.backend)
                 continue
 
-            # Create TTS voices in Wyoming Protocol format
+            # Create TTS voices in Wyoming Protocol format, preserving streaming model info
             tts_voice_models.extend(create_tts_voices(
                 tts_models=[model_name],
+                tts_streaming_models=streaming_model_names,
                 tts_voices=tts_voices,
                 tts_url=str(self.base_url),
                 languages=languages
             ))
         return tts_voice_models
 
+
     @classmethod
     def create_autodetected_factory(cls):
         """