A FastAPI-based REST API service for speech-to-text transcription using NVIDIA's parakeet-tdt-0.6b-v2 model. This API provides high-quality English speech recognition with automatic punctuation, capitalization, and accurate word-level timestamps.
- 🎤 High-Quality Transcription: Uses NVIDIA's 600M parameter parakeet-tdt-0.6b-v2 model
- ⏱️ Accurate Timestamps: Provides word-level timing information
- 📝 Multiple Output Formats: JSON response or SRT subtitle format
- đź”§ Automatic Audio Processing: Handles resampling and channel conversion
- 🚀 Long Audio Support: Optimized settings for audio longer than 8 minutes
- 📊 OpenAPI Compatible: Full Swagger/OpenAPI documentation
- 🛡️ Error Handling: Comprehensive error handling and validation
- Python 3.8+
- CUDA-compatible GPU (recommended) or CPU
- FFmpeg (for audio processing)
- Clone the repository
git clone <your-repo-url>
cd parakeet-tdt-0.6b-v2
- Install dependencies
pip install -r requirements.txt
- Run the API server
python app.py
The API will be available at http://localhost:8000
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
GET /health
Response:
{
"status": "healthy",
"model_loaded": true,
"device": "cuda"
}
POST /transcribe
Parameters:
file
: Audio file (multipart/form-data)
Supported formats: WAV, MP3, FLAC, OGG, MP4
Response:
{
"success": true,
"segments": [
{
"start": 0.5,
"end": 2.1,
"text": "Hello, how are you today?"
},
{
"start": 2.5,
"end": 4.8,
"text": "I'm doing great, thank you for asking."
}
],
"duration": 15.3,
"message": "Transcription completed successfully"
}
POST /transcribe/srt
Parameters:
file
: Audio file (multipart/form-data)
Response: SRT file download
1
00:00:00,500 --> 00:00:02,100
Hello, how are you today?
2
00:00:02,500 --> 00:00:04,800
I'm doing great, thank you for asking.
import requests
# Health check
response = requests.get("http://localhost:8000/health")
print(response.json())
# Transcribe audio file
with open("audio.wav", "rb") as f:
files = {"file": ("audio.wav", f, "audio/wav")}
response = requests.post("http://localhost:8000/transcribe", files=files)
result = response.json()
if result["success"]:
for segment in result["segments"]:
print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s]: {segment['text']}")
# Health check
curl -X GET "http://localhost:8000/health"
# Transcribe audio
curl -X POST "http://localhost:8000/transcribe" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@audio.wav"
# Get SRT subtitle file
curl -X POST "http://localhost:8000/transcribe/srt" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@audio.wav" \
--output subtitles.srt
const FormData = require('form-data');
const fs = require('fs');
const axios = require('axios');
async function transcribeAudio(filePath) {
const form = new FormData();
form.append('file', fs.createReadStream(filePath));
try {
const response = await axios.post('http://localhost:8000/transcribe', form, {
headers: form.getHeaders()
});
console.log('Transcription result:', response.data);
return response.data;
} catch (error) {
console.error('Error:', error.response?.data || error.message);
}
}
transcribeAudio('audio.wav');
CUDA_VISIBLE_DEVICES
: Specify which GPU to use (default: auto-detect)MODEL_CACHE_DIR
: Directory to cache the model files
The API automatically:
- Detects available hardware (CUDA/CPU)
- Loads the parakeet-tdt-0.6b-v2 model on startup
- Applies optimized settings for long audio (>8 minutes)
- Handles memory cleanup after each request
- GPU: NVIDIA GPU with 4GB+ VRAM (recommended)
- CPU: Multi-core processor (fallback option)
- RAM: 8GB+ system memory
- Storage: 2GB+ for model cache
- Use GPU: Significantly faster than CPU processing
- Audio Format: WAV files typically process fastest
- File Size: For very long audio files (>3 hours), consider chunking
- Concurrent Requests: API handles one request at a time to avoid memory issues
The API provides detailed error messages for common issues:
- 400 Bad Request: Unsupported file format
- 413 Payload Too Large: File size exceeds limits
- 500 Internal Server Error: Processing or model errors
# With auto-reload
uvicorn app:app --reload --host 0.0.0.0 --port 8000
# Test with sample audio
curl -X POST "http://localhost:8000/transcribe" \
-H "Content-Type: multipart/form-data" \
-F "file=@test_audio.wav"
This project uses the NVIDIA parakeet-tdt-0.6b-v2 model, which is available for both commercial and non-commercial use. Please refer to the model card for detailed licensing information.
- Model loading fails: Check CUDA installation and GPU memory
- Audio processing errors: Ensure FFmpeg is installed
- Memory errors: Reduce concurrent requests or use CPU mode
- Check the API documentation at
/docs
- Review error messages in the server logs
- Ensure all dependencies are properly installed