Description
It seems the Silero VAD project is a very good candidate for being integrated directly in whisper.cpp
:
https://github.com/snakers4/silero-vad
The model appears to be very lightweight and can be used to extract voice activity timestamps that the user can then use to process with the core whisper.cpp
functionality, avoiding non-speech segments.
The model data of Silero VAD is stored in ONNX data format. From a quick search, I cannot find where the actual model tensors and architecture are defined, so we'll have to do some reverse engineering to convert this to a GGUF file. The only information I was able to find is that the model uses a multi-head attention (MHA) + short-time Fourier transform, which shouldn't be a problem to implement within whisper.cpp
.
Having an integrated VAD solution in the project would simplify the work of 3rd-party projects using whisper.cpp
. It is also likely to improve the notoriously bad performance of the Whisper Large v3 models.
Metadata
Metadata
Assignees
Type
Projects
Status