-
Notifications
You must be signed in to change notification settings - Fork 4.2k
vad : add initial Voice Activity Detection (VAD) support #3065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
5758650
to
9f0ed3d
Compare
Are there plans to add vad support for |
I think it would be nice to get an initial version merged first as this PR is quite large as it is. I can then start looking at adding support to the server, and hopefully during that time people can start trying this out and see what works and does not work. I'm adding the remaining options to whisper-cli now and after that this should be ready for review. |
This commit add support for Voice Activity Detection (VAD). When enabled this feature will process the audio input and detect speech segments. This information is then used to reduce the number of samples that need to be processed by whisper_full. This initial support is based on the Silero VAD model which needs to be converted to GGML format: ```console $ (venv) pip install silero-vad $ (venv) $ python models/convert-silero-vad-to-ggml.py --output models/silero.bin Saving GGML Silero-VAD model to models/silero-v5.1.2-ggml.bin ``` There is test the tests the VAD support in isolation: ```console $ cmake --build build --target test-vad && \ ctest -R ^test-vad$ --test-dir build -C Debug --output-on-failure -VV ``` And one that tests VAD in combination with whisper_full: ```console $ cmake --build build --target test-vad-full && \ ctest -R test-vad-full --test-dir build -C Debug --output-on-failure -VV ``` Resolves: ggml-org#3003
Example of format: ```console $ ./build/bin/whisper-cli --help usage: ./build/bin/whisper-cli [options] file0 file1 ... supported audio formats: flac, mp3, ogg, wav options: -h, --help [default] show this help message and exit ... Voice Activity Detection (VAD) options: -v, --vad [false ] enable Voice Activity Detection (VAD) -vm FNAME, --vad-model FNAME [ ] VAD model path -vt N, --vad-threshold N [0.50 ] VAD threshold for speech recognition -vs N, --vad_window_size_samples N [512 ] VAD window size -vspd N, --vad_min_speech_duration_ms N [250 ] VAD min speech duration -vsd N, --vad_min_silence_duration_ms N [100 ] VAD min silence duration -vmsd N, --vad_max_speech_duration_s N [FLT_MAX] VAD max speech duration -vp N, --vad_speech_pad_ms N [30 ] VAD speech padding -vo N, --vad_samples_overlap N [0.10 ] VAD samples overlap size ``` The main reason for the separate VAD options section is that the VAD options are longer and made the rest look a little ugly.
This commit adds a job to the CI pipeline to test the VAD model. This will only test the VAD model in isolation, that is it does not test whisper_full.
I am doing some initial testing using long audio and I am wondering if we can somehow align the output timestamps with the original audio? Right now, I think that the audio that is cut out is not taken into account, so the final timestamps are not aligned with the input audio and it is a bit difficult to evaluate the results. |
Ah yes, currently what is done is only the samples that are detected to contain speech are passed to |
This commit add support for Voice Activity Detection (VAD). When enabled
this feature will process the audio input and detect speech segments.
This information is then used to reduce the number of samples that need
to be processed by whisper_full.
This initial support is based on the Silero VAD model which needs to be converted to GGML format:
There is test the tests the VAD support in isolation:
And one that tests VAD in combination with whisper_full:
Resolves: #3003
whisper-cli example output