A Python-based service for converting audio files into viseme timing sequences. This tool uses whisper.cpp for speech recognition and converts the output into viseme sequences suitable for facial animation.
- Clone the repository:
git clone https://github.com/edmundman/Viseme_generator
cd Viseme_generator
- Install the required dependencies:
pip install -r requirements.txt
The service will automatically download and compile whisper.cpp and required models on first run.
Process an audio file directly:
python viseme_processor.py input_audio.wav --output output.timing
Options:
--output
: Specify output file path (optional)--install-path
: Custom installation path for whisper.cpp (optional)
Start the FastAPI server:
python vis_server.py
The server will start on http://localhost:8000
by default.
-
POST /process/
- Upload a WAV file for processing
- Returns JSON with viseme timing data
-
GET /health/
- Health check endpoint
- Returns server status
Using curl:
curl -X POST "http://localhost:8000/process/" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@your_audio.wav"
The service generates JSON timing data with the following structure:
[
{
"time": 0,
"type": "viseme",
"value": "sil"
},
{
"time": 100,
"type": "word",
"value": "hello",
"start": 100,
"end": 500
},
{
"time": 100,
"type": "viseme",
"value": "h"
}
// ... more visemes
]
The system uses the following viseme mappings: https://docs.aws.amazon.com/polly/latest/dg/ph-table-english-uk.html