whisper : add Silero VAD built-in support

It seems the Silero VAD project is a very good candidate for being integrated directly in `whisper.cpp`:

https://github.com/snakers4/silero-vad

The model appears to be very lightweight and can be used to extract voice activity timestamps that the user can then use to process with the core `whisper.cpp` functionality, avoiding non-speech segments.

The model data of Silero VAD is stored in [ONNX data format](https://github.com/snakers4/silero-vad/tree/master/src/silero_vad/data). From a quick search, I cannot find where the actual model tensors and architecture are defined, so we'll have to do some reverse engineering to convert this to a GGUF file. The only information I was able to find is that the model [uses a multi-head attention (MHA) + short-time Fourier transform](https://thegradient.pub/one-voice-detector-to-rule-them-all/), which shouldn't be a problem to implement within `whisper.cpp`.

Having an integrated VAD solution in the project would simplify the work of 3rd-party projects using `whisper.cpp`. It is also likely to improve the notoriously bad performance of the Whisper Large v3 models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

whisper : add Silero VAD built-in support #3003

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

whisper : add Silero VAD built-in support #3003

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions