Interpreto is a system for transcribing audio and video files. It provides an easy-to-use web interface for uploading media files and automatically generates transcriptions with timestamps. The modular architecture of the project, makes easy to add more functionality, like another interface (for example, a Telegram bot).
- Upload audio and video files for transcription
- Real-time transcription updates via server-sent events
- Display of transcription alongside media player
- Automatic subtitle generation in WebVTT format
- Language detection for transcripts
- Efficient audio processing with voice activity detection
Interpreto is built using a microservices architecture:
- Frontend: React-based web interface for uploading files and viewing transcriptions
- Backend API: FastAPI service that handles file uploads and job management
- Worker Service: Processes media files using Whisper models and voice detection
- Storage:
- MongoDB for storing file metadata and transcriptions
- MinIO for storing media files
- Redis for real-time communication between services
- Docker and Docker Compose
- Python 3.8+
- Node.js 14+
- CUDA-capable GPU (optional, for faster transcription)
- Clone the repository:
git clone https://github.com/mperalsapa/interpreto.git
cd interpreto
- Set up the environment variables (or use the defaults in the code):
MONGO_HOST=localhost
MONGO_PORT=27017
MONGO_USER=frontend-service
MONGO_PASS=frontend-service
MONGO_DB=media_service
# MinIO config
MINIO_HOST=localhost
MINIO_PORT=9000
MINIO_ACCESS_KEY=minioadmin
MINIO_SECRET_KEY=minioadmin
# Redis config
REDIS_HOST=localhost
REDIS_PORT=6379
- Start the services using Docker Compose:
docker-compose up -d
- Open the web application in your browser at http://localhost:8080
- Upload an audio or video file using the upload form
- Wait for the transcription to process - you'll see real-time updates
- View the transcription alongside the media player
- The transcription will be automatically displayed as subtitles
- Frontend: React
- Backend: FastAPI (Python)
- Speech Recognition: Whisper, faster-whisper
- Voice Activity Detection: Silero VAD
- Storage: MongoDB, MinIO, Redis
- Processing: PyTorch, torchaudio