Skip to content

Meet VoiceFlow πŸŽ™οΈπŸ”Š, your production-ready microservices platform for all things AI speech! It's designed to make high-performance voice processing a breeze, letting you effortlessly transcribe audio to text and convert text into natural-sounding speech. πŸš€

License

Notifications You must be signed in to change notification settings

Armaggheddon/VoiceFlow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation




VoiceFlow Logo

License Python Docker

A production-ready microservices platform for AI-powered speech processing

Features β€’ Quick Start β€’ Architecture β€’ API Reference β€’ Client Library

πŸ“‹ Overview

VoiceFlow is a scalable platform that provides two core AI services through a unified API:

  • πŸŽ™οΈ Voice-to-Text (V2T): Transcribe audio files to text using Whisper
  • πŸ”Š Text-to-Speech (T2V): Convert text to natural-sounding speech

Built with a microservices architecture using FastAPI, Docker, and NVIDIA Triton Inference Server for high-performance AI model serving.

✨ Features

  • πŸŽ™οΈ Speech Recognition: High-accuracy audio transcription using Whisper
  • πŸ”Š Speech Synthesis: Natural-sounding text-to-speech conversion
  • πŸš€ Microservices Architecture: Scalable, containerized services
  • ⚑ NVIDIA Triton: High-performance model inference server
  • πŸ“¦ Object Storage: MinIO for efficient audio file management
  • πŸ”„ Async Processing: Celery-based task queue with Redis
  • 🎯 REST API: Simple, well-documented HTTP endpoints
  • 🧹 Auto Cleanup: Automatic cleanup of temporary files
  • 🎨 Web Interface: Built-in Gradio-based demo UI
  • πŸ“š Python Client: Feature-rich client library with sync/async support

πŸš€ Quick Start

πŸ“‹ Prerequisites

  • Docker and Docker Compose
  • 8GB+ RAM (for AI models)
  • NVIDIA GPU (optional, but recommended for better performance)

1. Clone and Start

git clone https://github.com/Armaggheddon/VoiceFlow
cd VoiceFlow
docker compose up -d

2. Verify Services

# Check all services are running
docker compose ps

# Test the API
curl http://localhost:8000/health

3. Access the Demo UI

Open your browser to http://localhost:7860 to access the web interface. See the Demo UI section for more details.

4. Try the API

Transcribe audio:

curl -X POST http://localhost:8000/v1/transcribe \
     -F "audio_file=@sample.wav"

Synthesize speech:

curl -X POST http://localhost:8000/v1/synthesize \
     -F "text=Hello, this is VoiceFlow!"

πŸ—οΈ Architecture

System Architecture Image: Microservices architecture diagram showing API Gateway, Orchestrator (Celery), STT/TTS services, Triton Server, MinIO, and Redis

πŸ”§ Core Components

Service Technology Purpose
API Gateway FastAPI + Minio Public REST API endpoints and file upload handling
Orchestrator Redis + Celery Workflow coordination and task management
STT Service FastAPI + Triton + Minio Speech-to-text transcription using Whisper
TTS Service FastAPI + Triton + Minio Text-to-speech synthesis
Inference Service NVIDIA Triton High-performance model serving
Demo UI Gradio Web-based user interface
Cleanup Worker Python + Celery + Minio Automatic file cleanup

πŸ—οΈ Infrastructure

  • MinIO: S3-compatible object storage for audio files
  • Redis: Message broker and result storage for Celery
  • Docker: Containerization and orchestration

πŸ”„ Request Flow

πŸŽ™οΈ Voice-to-Text (V2T)

  1. Client uploads audio file to API Gateway
  2. File stored in MinIO, task queued in Orchestrator
  3. STT Service downloads file, processes with Whisper via Triton
  4. Transcription result stored in Redis
  5. Client polls for result and receives text

πŸ”Š Text-to-Speech (T2V)

  1. Client sends text to API Gateway
  2. Task queued in Orchestrator
  3. TTS Service generates audio via Triton, uploads to MinIO
  4. Audio URL stored in Redis
  5. Client polls for result and receives presigned download URL

βš™οΈ Configuration

GPU Support (Default configuration)

To enable GPU acceleration:

  1. Install NVIDIA Container Toolkit
  2. Restart services:
docker compose down
docker compose up -d

CPU-Only Mode

VoiceFlow works without GPU, though with reduced performance. To run in CPU-only mode, comment out the deploy section of the inference-service in docker-compose.yaml:

inference-service:
    build:
      context: .
      dockerfile: ./services/inference-service/Dockerfile
    restart: unless-stopped
    environment:
      # Available whisper models:
      # - tiny ~ 1GB RAM
      # - base ~ 1GB RAM
      # - small ~ 2GB RAM
      # - medium ~ 5GB RAM
      # - large ~ 10GB RAM
      # - turbo ~ 6GB RAM
      - WHISPER_MODEL_SIZE=small
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: 1
    #           capabilities: [gpu]
    volumes:
      - ./services/inference-service/model_repository:/model_repository
    networks:
      - voiceflow-net

Then, use the same docker-compose.yaml file to start the services:

# Use CPU-only configuration
docker compose up -d

πŸŽ›οΈ Model Customization

🎀 STT Models (Whisper)

Chose the whisper model size by setting the WHISPER_MODEL_SIZE environment variable in the inference-service section of docker-compose.yaml. Available options include:

  • tiny (1GB RAM)
  • base (1GB RAM)
  • small (2GB RAM)
  • medium (5GB RAM)
  • large (10GB RAM)
  • turbo (6GB RAM, optimized for speed)

πŸ—£οΈ TTS Models (Chatterbox)

  • The model used for TTS is Chatterbox from Resemble AI, which supports multiple voices and languages and is optimized for high-quality speech synthesis.

πŸ“š API Reference

🌐 Base URL

http://localhost:8000

πŸ”— Endpoints

POST /v1/transcribe

Transcribe audio file to text.

Request:

curl -X POST http://localhost:8000/v1/transcribe \
     -F "audio_file=@audio.wav"

Response:

{
  "task_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "PENDING"
}

POST /v1/synthesize

Convert text to speech.

Request:

curl -X POST http://localhost:8000/v1/synthesize \
     -F "text=Hello world"

Response:

{
  "task_id": "550e8400-e29b-41d4-a716-446655440001", 
  "status": "PENDING"
}

GET /v1/tasks/{task_id}

Get task result.

Transcription Result:

{
  "task_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "SUCCESS",
  "transcribed_text": "Hello, this is the transcribed text",
  "audio_url": null
}

Synthesis Result:

{
  "task_id": "550e8400-e29b-41d4-a716-446655440001",
  "status": "SUCCESS", 
  "transcribed_text": null,
  "audio_url": "https://presigned-download-url"
}

For complete API documentation, see API_DOCUMENTATION.md.

🐍 Client Library

VoiceFlow includes a comprehensive Python client library for easy integration:

πŸ“¦ Installation

cd client-library
pip install -e .

πŸ’‘ Quick Example

from voiceflow import VoiceFlowClient

# Initialize client
client = VoiceFlowClient(base_url="http://localhost:8000")

# Transcribe audio
result = client.transcribe("audio.wav")
print(f"Transcription: {result.transcribed_text}")

# Synthesize speech
result = client.synthesize("Hello, world!")
print(f"Audio URL: {result.audio_url}")

# Download audio as numpy array
audio_array = client.synthesize("Hello!", output_format="numpy")

✨ Features

  • πŸ”„ Sync & Async: Both synchronous and asynchronous interfaces
  • πŸ“ Type Hints: Full type annotation support
  • πŸ›‘οΈ Error Handling: Comprehensive error handling
  • ⏱️ Auto Polling: Built-in result polling with timeouts
  • 🎡 Multiple Formats: Support for various audio output formats

⚑ Async Usage

import asyncio
from voiceflow import AsyncVoiceFlowClient

async def main():
    async with AsyncVoiceFlowClient(base_url="http://localhost:8000") as client:
        # Concurrent processing
        tasks = [
            client.transcribe("audio1.wav"),
            client.transcribe("audio2.wav"),
            client.synthesize("Text to speech")
        ]
        results = await asyncio.gather(*tasks)
        
        for result in results:
            print(result)

asyncio.run(main())

See the client library documentation for detailed examples and API reference.

🎨 Demo UI

VoiceFlow includes a built-in web interface accessible at http://localhost:7860.

demoui-aichat demoui-v2t
demoui-t2v demoui-history

Image: Demo UI showing both transcription and synthesis interfaces with file upload and audio playback

✨ Features

  • πŸ“ File Upload: Drag-and-drop audio file upload
  • πŸŽ™οΈ Live Recording: Record audio directly in browser
  • πŸ”Š Audio Playback: Play synthesized audio inline
  • πŸ“‹ History: View previous transcriptions and syntheses
  • βš™οΈ Configuration: Adjust API settings and model parameters

πŸ› οΈ Development

πŸ“ Project Structure

voiceflow/
β”œβ”€β”€ services/
β”‚   β”œβ”€β”€ api-gateway/          # REST API endpoints
β”‚   β”œβ”€β”€ orchestrator/         # Task coordination
β”‚   β”œβ”€β”€ stt-service/          # Speech-to-text
β”‚   β”œβ”€β”€ tts-service/          # Text-to-speech
β”‚   β”œβ”€β”€ inference-service/    # NVIDIA Triton models
β”‚   β”œβ”€β”€ demo-ui/             # Gradio web interface
β”‚   └── cleanup-worker/      # File cleanup
β”œβ”€β”€ client-library/          # Python client
β”œβ”€β”€ shared/                  # Common models and utilities
β”œβ”€β”€ data/                    # Storage volumes
└── docker-compose.yaml      # Service orchestration

πŸš€ Performance Tuning

πŸ“ˆ Scaling Guidelines

  • API Gateway: CPU-bound, scale horizontally
  • STT/TTS Services: GPU-bound, scale based on GPU availability
  • Orchestrator: I/O-bound, scale based on queue depth
  • Triton Server: Memory-bound, tune model batch sizes

πŸ’Ύ Resource Requirements

Component Minimum Recommended
CPU 4 cores 8+ cores
RAM 8 GB 16+ GB
GPU None 8+ GB VRAM
Storage 10 GB 100+ GB SSD

🎯 Optimization Tips

  1. Enable GPU acceleration for 5-10x performance improvement
  2. Tune batch sizes in Triton model configurations
  3. Configure connection pooling for high-throughput scenarios
  4. Use faster storage (SSD) for MinIO data volumes
  5. Scale horizontally by adding more service replicas

πŸ™Œ Contributing

Contributions are welcome! Whether it's bug fixes, new features, or documentation improvements, feel free to open an issue or submit a pull request.

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments


Built with ❀️ for the AI community

⬆ Back to Top

About

Meet VoiceFlow πŸŽ™οΈπŸ”Š, your production-ready microservices platform for all things AI speech! It's designed to make high-performance voice processing a breeze, letting you effortlessly transcribe audio to text and convert text into natural-sounding speech. πŸš€

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published