A production-ready microservices platform for AI-powered speech processing
Features β’ Quick Start β’ Architecture β’ API Reference β’ Client Library
VoiceFlow is a scalable platform that provides two core AI services through a unified API:
- ποΈ Voice-to-Text (V2T): Transcribe audio files to text using Whisper
- π Text-to-Speech (T2V): Convert text to natural-sounding speech
Built with a microservices architecture using FastAPI, Docker, and NVIDIA Triton Inference Server for high-performance AI model serving.
- ποΈ Speech Recognition: High-accuracy audio transcription using Whisper
- π Speech Synthesis: Natural-sounding text-to-speech conversion
- π Microservices Architecture: Scalable, containerized services
- β‘ NVIDIA Triton: High-performance model inference server
- π¦ Object Storage: MinIO for efficient audio file management
- π Async Processing: Celery-based task queue with Redis
- π― REST API: Simple, well-documented HTTP endpoints
- π§Ή Auto Cleanup: Automatic cleanup of temporary files
- π¨ Web Interface: Built-in Gradio-based demo UI
- π Python Client: Feature-rich client library with sync/async support
- Docker and Docker Compose
- 8GB+ RAM (for AI models)
- NVIDIA GPU (optional, but recommended for better performance)
git clone https://github.com/Armaggheddon/VoiceFlow
cd VoiceFlow
docker compose up -d
# Check all services are running
docker compose ps
# Test the API
curl http://localhost:8000/health
Open your browser to http://localhost:7860 to access the web interface. See the Demo UI section for more details.
Transcribe audio:
curl -X POST http://localhost:8000/v1/transcribe \
-F "audio_file=@sample.wav"
Synthesize speech:
curl -X POST http://localhost:8000/v1/synthesize \
-F "text=Hello, this is VoiceFlow!"
Image: Microservices architecture diagram showing API Gateway, Orchestrator (Celery), STT/TTS services, Triton Server, MinIO, and Redis
Service | Technology | Purpose |
---|---|---|
API Gateway | FastAPI + Minio | Public REST API endpoints and file upload handling |
Orchestrator | Redis + Celery | Workflow coordination and task management |
STT Service | FastAPI + Triton + Minio | Speech-to-text transcription using Whisper |
TTS Service | FastAPI + Triton + Minio | Text-to-speech synthesis |
Inference Service | NVIDIA Triton | High-performance model serving |
Demo UI | Gradio | Web-based user interface |
Cleanup Worker | Python + Celery + Minio | Automatic file cleanup |
- MinIO: S3-compatible object storage for audio files
- Redis: Message broker and result storage for Celery
- Docker: Containerization and orchestration
- Client uploads audio file to API Gateway
- File stored in MinIO, task queued in Orchestrator
- STT Service downloads file, processes with Whisper via Triton
- Transcription result stored in Redis
- Client polls for result and receives text
- Client sends text to API Gateway
- Task queued in Orchestrator
- TTS Service generates audio via Triton, uploads to MinIO
- Audio URL stored in Redis
- Client polls for result and receives presigned download URL
To enable GPU acceleration:
- Install NVIDIA Container Toolkit
- Restart services:
docker compose down
docker compose up -d
VoiceFlow works without GPU, though with reduced performance. To run in CPU-only mode, comment out the deploy
section of the inference-service
in docker-compose.yaml
:
inference-service:
build:
context: .
dockerfile: ./services/inference-service/Dockerfile
restart: unless-stopped
environment:
# Available whisper models:
# - tiny ~ 1GB RAM
# - base ~ 1GB RAM
# - small ~ 2GB RAM
# - medium ~ 5GB RAM
# - large ~ 10GB RAM
# - turbo ~ 6GB RAM
- WHISPER_MODEL_SIZE=small
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: 1
# capabilities: [gpu]
volumes:
- ./services/inference-service/model_repository:/model_repository
networks:
- voiceflow-net
Then, use the same docker-compose.yaml
file to start the services:
# Use CPU-only configuration
docker compose up -d
Chose the whisper model size by setting the WHISPER_MODEL_SIZE
environment variable in the inference-service
section of docker-compose.yaml
. Available options include:
tiny
(1GB RAM)base
(1GB RAM)small
(2GB RAM)medium
(5GB RAM)large
(10GB RAM)turbo
(6GB RAM, optimized for speed)
- The model used for TTS is Chatterbox from Resemble AI, which supports multiple voices and languages and is optimized for high-quality speech synthesis.
http://localhost:8000
Transcribe audio file to text.
Request:
curl -X POST http://localhost:8000/v1/transcribe \
-F "audio_file=@audio.wav"
Response:
{
"task_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "PENDING"
}
Convert text to speech.
Request:
curl -X POST http://localhost:8000/v1/synthesize \
-F "text=Hello world"
Response:
{
"task_id": "550e8400-e29b-41d4-a716-446655440001",
"status": "PENDING"
}
Get task result.
Transcription Result:
{
"task_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "SUCCESS",
"transcribed_text": "Hello, this is the transcribed text",
"audio_url": null
}
Synthesis Result:
{
"task_id": "550e8400-e29b-41d4-a716-446655440001",
"status": "SUCCESS",
"transcribed_text": null,
"audio_url": "https://presigned-download-url"
}
For complete API documentation, see API_DOCUMENTATION.md.
VoiceFlow includes a comprehensive Python client library for easy integration:
cd client-library
pip install -e .
from voiceflow import VoiceFlowClient
# Initialize client
client = VoiceFlowClient(base_url="http://localhost:8000")
# Transcribe audio
result = client.transcribe("audio.wav")
print(f"Transcription: {result.transcribed_text}")
# Synthesize speech
result = client.synthesize("Hello, world!")
print(f"Audio URL: {result.audio_url}")
# Download audio as numpy array
audio_array = client.synthesize("Hello!", output_format="numpy")
- π Sync & Async: Both synchronous and asynchronous interfaces
- π Type Hints: Full type annotation support
- π‘οΈ Error Handling: Comprehensive error handling
- β±οΈ Auto Polling: Built-in result polling with timeouts
- π΅ Multiple Formats: Support for various audio output formats
import asyncio
from voiceflow import AsyncVoiceFlowClient
async def main():
async with AsyncVoiceFlowClient(base_url="http://localhost:8000") as client:
# Concurrent processing
tasks = [
client.transcribe("audio1.wav"),
client.transcribe("audio2.wav"),
client.synthesize("Text to speech")
]
results = await asyncio.gather(*tasks)
for result in results:
print(result)
asyncio.run(main())
See the client library documentation for detailed examples and API reference.
VoiceFlow includes a built-in web interface accessible at http://localhost:7860.
![]() |
![]() |
![]() |
![]() |
Image: Demo UI showing both transcription and synthesis interfaces with file upload and audio playback
- π File Upload: Drag-and-drop audio file upload
- ποΈ Live Recording: Record audio directly in browser
- π Audio Playback: Play synthesized audio inline
- π History: View previous transcriptions and syntheses
- βοΈ Configuration: Adjust API settings and model parameters
voiceflow/
βββ services/
β βββ api-gateway/ # REST API endpoints
β βββ orchestrator/ # Task coordination
β βββ stt-service/ # Speech-to-text
β βββ tts-service/ # Text-to-speech
β βββ inference-service/ # NVIDIA Triton models
β βββ demo-ui/ # Gradio web interface
β βββ cleanup-worker/ # File cleanup
βββ client-library/ # Python client
βββ shared/ # Common models and utilities
βββ data/ # Storage volumes
βββ docker-compose.yaml # Service orchestration
- API Gateway: CPU-bound, scale horizontally
- STT/TTS Services: GPU-bound, scale based on GPU availability
- Orchestrator: I/O-bound, scale based on queue depth
- Triton Server: Memory-bound, tune model batch sizes
Component | Minimum | Recommended |
---|---|---|
CPU | 4 cores | 8+ cores |
RAM | 8 GB | 16+ GB |
GPU | None | 8+ GB VRAM |
Storage | 10 GB | 100+ GB SSD |
- Enable GPU acceleration for 5-10x performance improvement
- Tune batch sizes in Triton model configurations
- Configure connection pooling for high-throughput scenarios
- Use faster storage (SSD) for MinIO data volumes
- Scale horizontally by adding more service replicas
Contributions are welcome! Whether it's bug fixes, new features, or documentation improvements, feel free to open an issue or submit a pull request.
This project is licensed under the MIT License - see the LICENSE file for details.
- Chatterbox for speech synthesis
- OpenAI Whisper for speech recognition
- NVIDIA Triton for model serving
- FastAPI for API framework
- Celery for task management
- MinIO for object storage
- Redis for message brokering
- Gradio for the demo interface
Built with β€οΈ for the AI community