Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 21 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Note: This project is not affiliated with OpenAI or the Wyoming project.

## Overview

This project introduces a [Wyoming](https://github.com/OHF-Voice/wyoming) server that connects to OpenAI-compatible endpoints of your choice. Like a proxy, it enables Wyoming clients such as the [Home Assistant Wyoming Integration](https://www.home-assistant.io/integrations/wyoming/) to use the transcription (Automatic Speech Recognition - ASR) and text-to-speech synthesis (TTS) capabilities of various OpenAI-compatible projects. By acting as a bridge between the Wyoming protocol and OpenAI, you can consolidate the resource usage on your server and extend the capabilities of Home Assistant.
This project introduces a [Wyoming](https://github.com/OHF-Voice/wyoming) server that connects to OpenAI-compatible endpoints of your choice. Like a proxy, it enables Wyoming clients such as the [Home Assistant Wyoming Integration](https://www.home-assistant.io/integrations/wyoming/) to use the transcription (Automatic Speech Recognition - ASR) and text-to-speech synthesis (TTS) capabilities of various OpenAI-compatible projects. By acting as a bridge between the Wyoming protocol and OpenAI, you can consolidate the resource usage on your server and extend the capabilities of Home Assistant. The proxy now provides incremental TTS streaming compatibility by intelligently chunking text at sentence boundaries for responsive audio delivery.

## Featured Models

Expand All @@ -28,7 +28,8 @@ This project features a variety of examples for using cutting-edge models in bot
2. **Service Consolidation**: Allow users of various programs to run inference on a single server without needing separate instances for each service.
Example: Sharing TTS/STT services between [Open WebUI](#open-webui) and [Home Assistant](#usage-in-home-assistant).
3. **Asynchronous Processing**: Enable efficient handling of multiple requests by supporting asynchronous processing of audio streams.
4. **Simple Setup with Docker**: Provide a straightforward deployment process using [Docker and Docker Compose](#docker-recommended) for OpenAI and various popular open source projects.
4. **Streaming Compatibility**: Bridge Wyoming's streaming TTS protocol with OpenAI-compatible APIs through intelligent sentence boundary chunking, enabling responsive incremental audio delivery even when the underlying API doesn't support streaming text input.
5. **Simple Setup with Docker**: Provide a straightforward deployment process using [Docker and Docker Compose](#docker-recommended) for OpenAI and various popular open source projects.

## Terminology

Expand Down Expand Up @@ -112,6 +113,7 @@ python -m wyoming_openai \
--tts-openai-key YOUR_TTS_API_KEY_HERE \
--tts-openai-url https://api.openai.com/v1 \
--tts-models gpt-4o-mini-tts tts-1-hd tts-1 \
--tts-streaming-models tts-1 \
--tts-voices alloy ash coral echo fable onyx nova sage shimmer \
--tts-backend OPENAI \
--tts-speed 1.0
Expand Down Expand Up @@ -142,6 +144,9 @@ In addition to using command-line arguments, you can configure the Wyoming OpenA
| `--tts-backend` | `TTS_BACKEND` | None (autodetected) | Enable unofficial API feature sets. |
| `--tts-speed` | `TTS_SPEED` | None (autodetected) | Speed of the TTS output (ranges from 0.25 to 4.0). |
| `--tts-instructions` | `TTS_INSTRUCTIONS` | None | Optional instructions for TTS requests (Control the voice). |
| `--tts-streaming-models` | `TTS_STREAMING_MODELS` | None | Space-separated list of TTS models to enable incremental streaming via text chunking (e.g. `tts-1`). |
| `--tts-streaming-min-words` | `TTS_STREAMING_MIN_WORDS` | None | Minimum words per text chunk for incremental TTS streaming (optional). |
| `--tts-streaming-max-chars` | `TTS_STREAMING_MAX_CHARS` | None | Maximum characters per text chunk for incremental TTS streaming (optional). |

## Docker (Recommended) [![Docker Image CI](https://github.com/roryeckel/wyoming-openai/actions/workflows/docker-image.yml/badge.svg)](https://github.com/roryeckel/wyoming-openai/actions/workflows/docker-image.yml)

Expand Down Expand Up @@ -376,15 +381,25 @@ sequenceDiagram
WY->>HA: AudioStop event
else Streaming TTS (SynthesizeStart/Chunk/Stop)
HA->>WY: SynthesizeStart event (voice config)
Note over WY: Initialize synthesis buffer
Note over WY: Initialize incremental synthesis<br/>with sentence boundary detection
WY->>HA: AudioStart event
loop Sending text chunks
HA->>WY: SynthesizeChunk events
Note over WY: Append to synthesis buffer
Note over WY: Accumulate text and detect<br/>complete sentences using pysbd
alt Complete sentences detected
loop For each complete sentence
WY->>OAPI: Speech synthesis request
loop While receiving audio data
OAPI-->>WY: Audio stream chunks
WY-->>HA: AudioChunk events (incremental)
end
end
end
end
HA->>WY: SynthesizeStop event
Note over WY: No-op — OpenAI `/v1/audio/speech`<br/>does not support streaming text input
Note over WY: Process any remaining text<br/>and finalize synthesis
WY->>HA: AudioStop event
WY->>HA: SynthesizeStopped event
Note over WY: Streaming flow is handled<br/>but not advertised in capabilities
end
```

Expand Down
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@ classifiers = [
]
dependencies = [
"openai==1.98.0",
"wyoming==1.7.2"
"wyoming==1.7.2",
"pysbd==0.3.4"
]

[project.urls]
Expand Down
51 changes: 44 additions & 7 deletions src/wyoming_openai/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,24 @@ async def main():
default=os.getenv("TTS_INSTRUCTIONS", None),
help="Optional instructions for TTS requests"
)
parser.add_argument(
"--tts-streaming-models",
nargs="+",
default=os.getenv("TTS_STREAMING_MODELS", '').split(),
help="Space-separated list of TTS model names that support streaming synthesis (e.g. tts-1)"
)
parser.add_argument(
"--tts-streaming-min-words",
type=int,
default=int(os.getenv("TTS_STREAMING_MIN_WORDS")) if os.getenv("TTS_STREAMING_MIN_WORDS") else None,
help="Minimum words per chunk for streaming TTS (optional)"
)
parser.add_argument(
"--tts-streaming-max-chars",
type=int,
default=int(os.getenv("TTS_STREAMING_MAX_CHARS")) if os.getenv("TTS_STREAMING_MAX_CHARS") else None,
help="Maximum characters per chunk for streaming TTS (optional)"
)

args = parser.parse_args()

Expand Down Expand Up @@ -179,12 +197,12 @@ async def main():

if args.tts_voices:
# If TTS_VOICES is set, use that
tts_voices = create_tts_voices(args.tts_models, args.tts_voices, args.tts_openai_url, args.languages)
tts_voices = create_tts_voices(args.tts_models, args.tts_streaming_models, args.tts_voices, args.tts_openai_url, args.languages)
else:
# Otherwise, list supported voices via defaults
tts_voices = await tts_client.list_supported_voices(args.tts_models, args.languages)
# Otherwise, list supported voices via backend (with streaming fallback)
tts_voices = await tts_client.list_supported_voices(args.tts_models, args.tts_streaming_models, args.languages)

tts_programs = create_tts_programs(tts_voices)
tts_programs = create_tts_programs(tts_voices, tts_streaming_models=args.tts_streaming_models)

# Ensure at least one model is specified
if not asr_programs and not tts_programs:
Expand Down Expand Up @@ -218,8 +236,25 @@ async def main():
_logger.warning("No ASR models specified")

if tts_programs:
all_tts_voices = [voice for prog in tts_programs for voice in prog.voices]
_logger.info("*** TTS Voices ***\n%s", "\n".join(tts_voice_to_string(x) for x in all_tts_voices))
streaming_tts_voices_for_logging = []
non_streaming_tts_voices_for_logging = []

for prog in tts_programs:
for voice in prog.voices:
if getattr(prog, 'supports_synthesize_streaming', False):
streaming_tts_voices_for_logging.append(voice)
else:
non_streaming_tts_voices_for_logging.append(voice)

if streaming_tts_voices_for_logging:
_logger.info("*** Streaming TTS Voices ***\n%s", "\n".join(tts_voice_to_string(x) for x in streaming_tts_voices_for_logging))
else:
_logger.info("No Streaming TTS voices specified")

if non_streaming_tts_voices_for_logging:
_logger.info("*** Non-Streaming TTS Voices ***\n%s", "\n".join(tts_voice_to_string(x) for x in non_streaming_tts_voices_for_logging))
else:
_logger.info("No Non-Streaming TTS voices specified")
else:
_logger.warning("No TTS models specified")

Expand All @@ -238,7 +273,9 @@ async def main():
stt_temperature=args.stt_temperature,
tts_speed=args.tts_speed,
tts_instructions=args.tts_instructions,
stt_prompt=args.stt_prompt
stt_prompt=args.stt_prompt,
tts_streaming_min_words=args.tts_streaming_min_words,
tts_streaming_max_chars=args.tts_streaming_max_chars
)
)

Expand Down
98 changes: 83 additions & 15 deletions src/wyoming_openai/compatibility.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,24 +116,44 @@ def create_asr_programs(

def create_tts_voices(
tts_models: list[str],
tts_streaming_models: list[str],
tts_voices: list[str],
tts_url: str,
languages: list[str]
) -> list[TtsVoiceModel]:
"""
Creates a list of TTS (Text-to-Speech) voice models in the Wyoming Protocol format.
Uses streaming models as fallback if regular models not specified (consistent with ASR behavior).

Args:
tts_models (list[str]): A list of TTS model identifiers.
tts_streaming_models (list[str]): A list of TTS streaming model identifiers.
tts_voices (list[str]): A list of voice identifiers.
tts_url (str): The URL for the TTS service attribution.
languages (list[str]): A list of supported languages.

Returns:
list[TtsVoiceModel]: A list of Wyoming TtsVoiceModel instances.
"""
voices = []
# Create ordered list: streaming models first, then non-streaming, preserving natural order and deduplicating
# (same pattern as create_asr_programs)
seen = set()
ordered_models = []

# Add streaming models first
for model_name in tts_streaming_models:
if model_name not in seen:
ordered_models.append(model_name)
seen.add(model_name)

# Add non-streaming models
for model_name in tts_models:
if model_name not in seen:
ordered_models.append(model_name)
seen.add(model_name)

voices = []
for model_name in ordered_models:
for voice in tts_voices:
voices.append(TtsVoiceModel(
name=voice,
Expand All @@ -149,32 +169,64 @@ def create_tts_voices(
))
return voices

def create_tts_programs(tts_voices: list[TtsVoiceModel]) -> list[TtsProgram]:
def create_tts_programs(tts_voices: list[TtsVoiceModel], tts_streaming_models: list[str] = None) -> list[TtsProgram]:
"""
Create TTS programs from a list of voices.
Create TTS programs from a list of voices, separating voices based on streaming model support.

Args:
tts_voices (list[TtsVoiceModel]): A list of TTS voice models.
tts_streaming_models (list[str]): List of TTS model names that support streaming.

Returns:
list[TtsProgram]: A list of Wyoming TTS programs.
"""
if not tts_voices:
return []

return [
TtsProgram(
if tts_streaming_models is None:
tts_streaming_models = []

# Separate streaming and non-streaming voices based on their models
streaming_tts_voices = []
non_streaming_tts_voices = []

for voice in tts_voices:
if voice.model_name in tts_streaming_models:
streaming_tts_voices.append(voice)
else:
non_streaming_tts_voices.append(voice)

programs = []

if streaming_tts_voices:
programs.append(TtsProgram(
name="openai-streaming",
description="OpenAI (Streaming)",
attribution=Attribution(
name=ATTRIBUTION_NAME_PROGRAM_STREAMING,
url=ATTRIBUTION_URL,
),
installed=True,
version=__version__,
voices=streaming_tts_voices,
supports_synthesize_streaming=True,
))

if non_streaming_tts_voices:
programs.append(TtsProgram(
name="openai",
description="OpenAI",
description="OpenAI (Non-Streaming)",
attribution=Attribution(
name=ATTRIBUTION_NAME_PROGRAM,
url=ATTRIBUTION_URL,
),
installed=True,
version=__version__,
voices=tts_voices,
)
]
voices=non_streaming_tts_voices,
supports_synthesize_streaming=False,
))

return programs


def create_info(asr_programs: list[AsrProgram], tts_programs: list[TtsProgram]) -> Info:
Expand Down Expand Up @@ -404,16 +456,30 @@ async def _list_speaches_voices(self, model_name: str) -> list[str]:

# Unified API

async def list_supported_voices(self, model_names: str | list[str], languages: list[str]) -> list[TtsVoiceModel]:
async def list_supported_voices(self, model_names: list[str], streaming_model_names: list[str], languages: list[str]) -> list[TtsVoiceModel]:
"""
Fetches the available voices via unofficial specs.
Fetches the available voices via unofficial specs with streaming model fallback (consistent with ASR behavior).
Uses streaming models if regular models not specified.
Note: this is not the list of CONFIGURED voices.
"""
if isinstance(model_names, str):
model_names = [model_names]
# Use the same fallback pattern as create_asr_programs
seen = set()
ordered_models = []

tts_voice_models = []
# Add streaming models first
for model_name in streaming_model_names:
if model_name not in seen:
ordered_models.append(model_name)
seen.add(model_name)

# Add non-streaming models
for model_name in model_names:
if model_name not in seen:
ordered_models.append(model_name)
seen.add(model_name)

tts_voice_models = []
for model_name in ordered_models:
if self.backend == OpenAIBackend.OPENAI:
tts_voices = await self.list_openai_voices()
elif self.backend == OpenAIBackend.SPEACHES:
Expand All @@ -426,15 +492,17 @@ async def list_supported_voices(self, model_names: str | list[str], languages: l
_LOGGER.warning("Unknown backend: %s", self.backend)
continue

# Create TTS voices in Wyoming Protocol format
# Create TTS voices in Wyoming Protocol format, preserving streaming model info
tts_voice_models.extend(create_tts_voices(
tts_models=[model_name],
tts_streaming_models=streaming_model_names,
tts_voices=tts_voices,
tts_url=str(self.base_url),
languages=languages
))
return tts_voice_models


@classmethod
def create_autodetected_factory(cls):
"""
Expand Down
Loading