-
Notifications
You must be signed in to change notification settings - Fork 4.2k
whisper : add Silero VAD built-in support #3003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I have been using the Silero VAD with whisper.cpp by making a script to manage files manually. |
I used the VAD in Go using onnxruntime, you can see the interface here https://github.com/streamer45/silero-vad-go/blob/master/speech/ort_bridge.h |
Thanks. The goal is to not depend on ONNX or python and to natively run the VAD using |
Just for your information and so we don't duplicate efforts, I've started working on this issue (if an issue is assigned it is most often being worked on). I hope to be able to share some progress soon. Sorry about the late response, I have gathered some notes about the progress and I hope to have a pull request for this next week (short week here due to Easter). |
Thanks. waiting for it. |
This commit add support for Voice Activity Detection (VAD). This is currently a work in progress and is not yet fully functional. A silero-vad model can be converted using: ```console $ (venv) pip install silero-vad $ (venv) $ python models/convert-silero-vad-to-ggml.py --output models/silero.bin Saving GGML Silero-VAD model to models/silero-v5.1.2-ggml.bin ``` And there is test the tests the VAD support in isolation: ```console $ cmake --build build --target test-vad && \ ctest -R ^test-vad$ --test-dir build --output-on-failure -VV ``` And one that tests VAD in combination with whisper_full: ```console $ cmake --build build --target test-vad-full && \ ctest -R test-vad-full --test-dir build --output-on-failure -VV ``` Resolves: ggml-org#3003
This commit add support for Voice Activity Detection (VAD). This is currently a work in progress and is not yet fully functional. A silero-vad model can be converted using: ```console $ (venv) pip install silero-vad $ (venv) $ python models/convert-silero-vad-to-ggml.py --output models/silero.bin Saving GGML Silero-VAD model to models/silero-v5.1.2-ggml.bin ``` And there is test the tests the VAD support in isolation: ```console $ cmake --build build --target test-vad && \ ctest -R ^test-vad$ --test-dir build --output-on-failure -VV ``` And one that tests VAD in combination with whisper_full: ```console $ cmake --build build --target test-vad-full && \ ctest -R test-vad-full --test-dir build --output-on-failure -VV ``` Resolves: ggml-org#3003
This commit add support for Voice Activity Detection (VAD). This is currently a work in progress and is not yet fully functional. A silero-vad model can be converted using: ```console $ (venv) pip install silero-vad $ (venv) $ python models/convert-silero-vad-to-ggml.py --output models/silero.bin Saving GGML Silero-VAD model to models/silero-v5.1.2-ggml.bin ``` And there is test the tests the VAD support in isolation: ```console $ cmake --build build --target test-vad && \ ctest -R ^test-vad$ --test-dir build --output-on-failure -VV ``` And one that tests VAD in combination with whisper_full: ```console $ cmake --build build --target test-vad-full && \ ctest -R test-vad-full --test-dir build --output-on-failure -VV ``` Resolves: ggml-org#3003
This commit add support for Voice Activity Detection (VAD). This is currently a work in progress and is not yet fully functional. A silero-vad model can be converted using: ```console $ (venv) pip install silero-vad $ (venv) $ python models/convert-silero-vad-to-ggml.py --output models/silero.bin Saving GGML Silero-VAD model to models/silero-v5.1.2-ggml.bin ``` And there is test the tests the VAD support in isolation: ```console $ cmake --build build --target test-vad && \ ctest -R ^test-vad$ --test-dir build -C Debug --output-on-failure -VV ``` And one that tests VAD in combination with whisper_full: ```console $ cmake --build build --target test-vad-full && \ ctest -R test-vad-full --test-dir build -C Debug --output-on-failure -VV ``` Resolves: ggml-org#3003
This commit add support for Voice Activity Detection (VAD). When enabled this feature will process the audio input and detect speech segments. This information is then used to reduce the number of samples that need to be processed by whisper_full. This initial support is based on the Silero VAD model which needs to be converted to GGML format: ```console $ (venv) pip install silero-vad $ (venv) $ python models/convert-silero-vad-to-ggml.py --output models/silero.bin Saving GGML Silero-VAD model to models/silero-v5.1.2-ggml.bin ``` There is test the tests the VAD support in isolation: ```console $ cmake --build build --target test-vad && \ ctest -R ^test-vad$ --test-dir build -C Debug --output-on-failure -VV ``` And one that tests VAD in combination with whisper_full: ```console $ cmake --build build --target test-vad-full && \ ctest -R test-vad-full --test-dir build -C Debug --output-on-failure -VV ``` Resolves: ggml-org#3003
This commit add support for Voice Activity Detection (VAD). When enabled this feature will process the audio input and detect speech segments. This information is then used to reduce the number of samples that need to be processed by whisper_full. This initial support is based on the Silero VAD model which needs to be converted to GGML format: ```console $ (venv) pip install silero-vad $ (venv) $ python models/convert-silero-vad-to-ggml.py --output models/silero.bin Saving GGML Silero-VAD model to models/silero-v5.1.2-ggml.bin ``` There is test the tests the VAD support in isolation: ```console $ cmake --build build --target test-vad && \ ctest -R ^test-vad$ --test-dir build -C Debug --output-on-failure -VV ``` And one that tests VAD in combination with whisper_full: ```console $ cmake --build build --target test-vad-full && \ ctest -R test-vad-full --test-dir build -C Debug --output-on-failure -VV ``` Resolves: ggml-org#3003
It seems the Silero VAD project is a very good candidate for being integrated directly in
whisper.cpp
:https://github.com/snakers4/silero-vad
The model appears to be very lightweight and can be used to extract voice activity timestamps that the user can then use to process with the core
whisper.cpp
functionality, avoiding non-speech segments.The model data of Silero VAD is stored in ONNX data format. From a quick search, I cannot find where the actual model tensors and architecture are defined, so we'll have to do some reverse engineering to convert this to a GGUF file. The only information I was able to find is that the model uses a multi-head attention (MHA) + short-time Fourier transform, which shouldn't be a problem to implement within
whisper.cpp
.Having an integrated VAD solution in the project would simplify the work of 3rd-party projects using
whisper.cpp
. It is also likely to improve the notoriously bad performance of the Whisper Large v3 models.The text was updated successfully, but these errors were encountered: