Skip to content

whisper : add Silero VAD built-in support #3003

Closed
@ggerganov

Description

@ggerganov

It seems the Silero VAD project is a very good candidate for being integrated directly in whisper.cpp:

https://github.com/snakers4/silero-vad

The model appears to be very lightweight and can be used to extract voice activity timestamps that the user can then use to process with the core whisper.cpp functionality, avoiding non-speech segments.

The model data of Silero VAD is stored in ONNX data format. From a quick search, I cannot find where the actual model tensors and architecture are defined, so we'll have to do some reverse engineering to convert this to a GGUF file. The only information I was able to find is that the model uses a multi-head attention (MHA) + short-time Fourier transform, which shouldn't be a problem to implement within whisper.cpp.

Having an integrated VAD solution in the project would simplify the work of 3rd-party projects using whisper.cpp. It is also likely to improve the notoriously bad performance of the Whisper Large v3 models.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions