A robust backend server for audio processing, delivering high-accuracy transcription and speaker diarization.
Powered by Whisper for speech-to-text and Pyannote for speaker segmentation, wrapped in a clean, maintainable architecture based on Domain-Driven Design (DDD) and Hexagonal Architecture.
- High-Accuracy Transcription: Powered by OpenAI's Whisper models.
- Speaker Diarization: Identify who spoke when using Pyannote models.
- Segmented Results: Provides speaker-separated transcriptions with precise timestamps.
- Asynchronous Workflow: Upload audio first, transcribe later using a
clip_id
. - Clean Architecture: Follows DDD and Hexagonal (Ports & Adapters) principles for scalability and maintainability.
- Configurable Models: Easily switch between Whisper/Pyannote models via environment variables.
This project implements a strict Hexagonal Architecture (Ports & Adapters) with Domain-Driven Design:
Layer | Responsibility | Key Components |
---|---|---|
Domain | Core business entities, interfaces (ports), and business rules | AudioClip , SpeakerSegment , TranscriptionText , DiarizationPort , TranscriptionPort |
Application | Orchestrates use cases by combining domain logic | TranscribeAudioUseCase , StoreAudioUseCase |
Adapters | Input/output adapters implementing domain ports | Input: FastAPI routers , Output: ChunkedDiarizationService , WhisperTranscriptionService |
Infrastructure | Technical implementations and DI container | DIContainer , repository implementations, model providers |
Key architectural concepts implemented:
- Dependency Inversion: All dependencies flow inward toward the domain
- Dependency Injection: Services injected via FastAPI's dependency system
- Ports & Adapters: Clean separation through interfaces (ports) and implementations (adapters)
- Single Responsibility: Each component has exactly one reason to change
This structure enables:
- ✅ Testability: Mock any external system through port interfaces
- ✅ Maintainability: Change implementations without affecting business logic
- ✅ Flexibility: Swap out infrastructure components with minimal impact
- Python 3.10+
- Poetry for dependency management
- A Hugging Face account and API Token (required for Pyannote models)
-
Clone the repository:
git clone https://github.com/Zhima-Mochi/whisper-v3-server.git cd whisper-v3-server
-
Configure environment variables:
cp .env.example .env
Edit
.env
and add your Hugging Face token:HUGGINGFACE_AUTH_TOKEN=hf_YOUR_SECRET_TOKEN
-
Install dependencies:
poetry install
-
Run the application:
poetry run uvicorn app:app --reload --host 0.0.0.0 --port 8000
➔ API available at
http://localhost:8000
-
Build the image:
docker build -t whisper-v3-server .
-
Run the container:
docker run -p 8000:8000 \ -e HUGGINGFACE_AUTH_TOKEN=your_token_here \ -v $(pwd)/audio_data:/tmp/whisper_v3_server_storage \ --name whisper-v3-server \ whisper-v3-server
➔ API available at
http://localhost:8000
All endpoints are under /api
.
Method | Endpoint | Description |
---|---|---|
POST |
/api/audio |
Upload audio file and receive clip_id |
GET |
/api/audio/{clip_id} |
Get information about a stored audio clip |
DELETE |
/api/audio/{clip_id} |
Delete an audio clip and its transcription |
Method | Endpoint | Description |
---|---|---|
POST |
/api/transcribe?clip_id={clip_id} |
Process audio with transcription & diarization |
POST |
/api/transcribe/stream?clip_id={clip_id} |
Stream results as they're processed |
GET |
/api/transcription/{clip_id} |
Get stored transcription results |
GET |
/api/transcription/stream/{clip_id} |
Stream stored transcription results |
DELETE |
/api/transcription/{clip_id} |
Delete transcription for a clip |
Upload Audio
{
"clip_id": "550e8400-e29b-41d4-a716-446655440000",
"message": "File uploaded successfully. Use this clip_id with the /api/transcribe endpoint."
}
Transcribe Audio
{
"segments": [
{
"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"audio_clip_id": "550e8400-e29b-41d4-a716-446655440000",
"start": 0.0,
"end": 2.5,
"speaker_label": "SPEAKER_01",
"text": "Hello, how are you today?"
}
// Additional segments...
]
}
Set via .env
or environment variables:
Variable | Description | Default | Required |
---|---|---|---|
HUGGINGFACE_AUTH_TOKEN |
Hugging Face token for Pyannote models | None |
✅ |
PYANNOTE_MODEL |
Model path for speaker diarization | pyannote/speaker-diarization |
|
WHISPER_MODEL |
Model path for transcription | openai/whisper-large-v3 |
|
AUDIO_STORAGE_PATH |
Path to store uploaded audio | /tmp/whisper_v3_server_storage |
|
TRANSCRIPTION_STORAGE_PATH |
Path to store transcription results | /tmp/whisper_v3_server_storage/transcription_texts |
|
APP_HOST |
Host to bind the API server | 0.0.0.0 |
|
APP_PORT |
Port to bind the API server | 8000 |
- API Framework: FastAPI
- Transcription: OpenAI Whisper
- Speaker Diarization: Pyannote Audio
- Dependency Management: Poetry
- Containerization: Docker
This project is licensed under the MIT License.
Done | Priority | Code | Milestone | Purpose & Key Actions |
---|---|---|---|---|
✔ | 1 | C-1 | Max out RTX 2060 single-GPU performance | Faster-Whisper small FP16 / int8_float16 → quantize first, then compare baseline; implement singleton model |
⬜ | 2 | B-1 | WebSocket Streaming MVP | Add /ws/stream : 500 ms Opus frame → Whisper → send_json ; 10 s ping/heartbeat |
⬜ | 3 | F-1 | Monitoring + Rate Limiting | Prometheus GPU/latency metrics, IP concurrency limit, timeout / 429 response |
⬜ | 4 | D-1 | Silero-VAD pre-segmentation | Silence > 600 ms → flush; 0.2 s overlap → save 20% GPU time |
⬜ | 5 | B-2 | HTTP/2 NDJSON Streaming | Change /transcribe/stream to application/x-ndjson + heartbeat lines |
⬜ | 6 | A-2 | Optional Diarization | Add diarize=true/false query param; skip Pyannote if not needed |
⬜ | 7 | C-2 | GPU↔CPU Pipeline | Whisper on GPU → asyncio.Queue → Pyannote on CPU; GPU can proceed immediately |
⬜ | 8 | H-1~4 | Dual-GPU management + Round-Robin | Scan with NVML, create ModelPool per GPU, load-balanced GPU selection; support 2x 2060/3060 |
⬜ | 9 | A-1 | Single-step API | Add /upload+transcribe endpoint with webhook callback; simplify client usage |
⬜ | 10 | H-5~6 | Run Pyannote on GPU2 / parallel pipeline | Load Pyannote on idle second GPU; true parallel speaker diarization + transcription |
⬜ | 11 | D-2 | Incremental output algorithm | Only send "new words" to avoid flickering on frontend |
⬜ | 12 | E-1 | Dual-model real-time + accuracy | Use tiny model for 0.5s partial, small model for 30s final → overwrite result |
⬜ | 13 | H-7~8 | Batch inference & config-driven pipeline | Batch=4 under high concurrency; move thresholds to .env |
⬜ | 14 | F-2 | Opus-compressed streaming | Frontend sends ogg/opus , backend handles decoding |
⬜ | 15 | G-1~2 | Disconnection recovery / resume & multiprocessing | Support offset retransmit, uvicorn --workers 2 + CUDA_VISIBLE_DEVICES |
⬜ | 16 | H-9~10 | Monitoring dashboard + Horizon | Grafana panels for concurrency / GPU heat; complete horizontal scaling |