Skip to content

vad : add initial Voice Activity Detection (VAD) support #3065

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

danbev
Copy link
Collaborator

@danbev danbev commented Apr 22, 2025

This commit add support for Voice Activity Detection (VAD). When enabled
this feature will process the audio input and detect speech segments.
This information is then used to reduce the number of samples that need
to be processed by whisper_full.

This initial support is based on the Silero VAD model which needs to be converted to GGML format:

$ (venv) pip install silero-vad
$ (venv) $ python models/convert-silero-vad-to-ggml.py --output models/silero.bin
Saving GGML Silero-VAD model to models/silero-v5.1.2-ggml.bin

There is test the tests the VAD support in isolation:

$ cmake --build build --target test-vad && \
    ctest -R ^test-vad$ --test-dir build -C Debug --output-on-failure -VV

And one that tests VAD in combination with whisper_full:

$ cmake --build build --target test-vad-full && \
    ctest -R test-vad-full --test-dir build -C Debug --output-on-failure -VV

Resolves: #3003


whisper-cli example output
./build/bin/whisper-cli -m models/ggml-base.en.bin -f samples/jfk.wav --vad --vad-model models/for-tests-silero-v5.1.2-ggml.bin
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (12th Gen Intel(R) Core(TM) i7-1260P)
whisper_init_with_params_no_state: devices    = 1
whisper_init_with_params_no_state: backends   = 1
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:          CPU total size =   147.37 MB
whisper_model_load: model size    =  147.37 MB
whisper_backend_init_gpu: no GPU found
whisper_init_state: kv self size  =    6.29 MB
whisper_init_state: kv cross size =   18.87 MB
whisper_init_state: kv pad  size  =    3.15 MB
whisper_init_state: compute buffer (conv)   =   16.26 MB
whisper_init_state: compute buffer (encode) =   85.86 MB
whisper_init_state: compute buffer (cross)  =    4.65 MB
whisper_init_state: compute buffer (decode) =   96.35 MB

system_info: n_threads = 4 / 16 | WHISPER : COREML = 0 | OPENVINO = 0 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...

whisper_full_with_state: VAD is enabled, processing speach segments only
whisper_vad_init_from_file_with_params_no_state: loading VAD model from 'models/for-tests-silero-v5.1.2-ggml.bin'
whisper_vad_init_from_file_with_params_no_state: n_encoder_layers = 4
whisper_vad_init_from_file_with_params_no_state: encoder_in_channels[0] = 129
whisper_vad_init_from_file_with_params_no_state: encoder_in_channels[1] = 128
whisper_vad_init_from_file_with_params_no_state: encoder_in_channels[2] = 64
whisper_vad_init_from_file_with_params_no_state: encoder_in_channels[3] = 64
whisper_vad_init_from_file_with_params_no_state: encoder_out_channels[0] = 128
whisper_vad_init_from_file_with_params_no_state: encoder_out_channels[1] = 64
whisper_vad_init_from_file_with_params_no_state: encoder_out_channels[2] = 64
whisper_vad_init_from_file_with_params_no_state: encoder_out_channels[3] = 128
whisper_vad_init_from_file_with_params_no_state: kernel_sizes[0] = 3
whisper_vad_init_from_file_with_params_no_state: kernel_sizes[1] = 3
whisper_vad_init_from_file_with_params_no_state: kernel_sizes[2] = 3
whisper_vad_init_from_file_with_params_no_state: kernel_sizes[3] = 3
whisper_vad_init_from_file_with_params_no_state: lstm_input_size = 128
whisper_vad_init_from_file_with_params_no_state: lstm_hidden_size = 128
whisper_vad_init_from_file_with_params_no_state: final_conv_in = 128
whisper_vad_init_from_file_with_params_no_state: final_conv_out = 1
whisper_vad_init_from_file_with_params_no_state:          CPU total size =     0.88 MB
whisper_vad_init_from_file_with_params_no_state: model size    =    0.88 MB
whisper_backend_init_gpu: no GPU found
whisper_vad_build_graph: Building VAD graph
whisper_vad_build_graph: stft output shape = [4, 129, 1]
whisper_vad_build_encoder_layer: building encoder layer
whisper_vad_build_graph: endoder output shape = [1, 128, 1]
whisper_vad_build_lstm_layer: building LSTM layer
whisper_vad_build_lstm_layer: hidden dimension = 128
whisper_vad_build_graph: lstm output shape = [128, 1, 1]
whisper_vad_init_state: compute buffer (VAD)   =    1.59 MB
whisper_vad_detect_speech_timestamps: detecting speech timestamps in 176000 samples
whisper_vad_detect_speech: detecting speech in 176000 samples
whisper_vad_detect_speech: n_chunks: 344
whisper_vad_build_graph: Building VAD graph
whisper_vad_build_graph: stft output shape = [4, 129, 1]
whisper_vad_build_encoder_layer: building encoder layer
whisper_vad_build_graph: endoder output shape = [1, 128, 1]
whisper_vad_build_lstm_layer: building LSTM layer
whisper_vad_build_lstm_layer: hidden dimension = 128
whisper_vad_build_graph: lstm output shape = [128, 1, 1]
whisper_vad_detect_speech: props size: 344
whisper_vad_detect_speech: chunk_len: 384 < n_window: 512
whisper_vad_detect_speech: finished processing 176000 samples
whisper_vad_timestamps_from_probs: detecting speech timestamps using 344 probabilities
whisper_vad_timestamps_from_probs: Merged 0 adjacent segments, now have 5 segments
whisper_vad_timestamps_from_probs: Final speech segments after filtering: 5
whisper_vad_timestamps_from_probs: VAD segment 0: start = 0.29, end = 2.21 (duration: 1.92)
whisper_vad_timestamps_from_probs: VAD segment 1: start = 3.30, end = 3.77 (duration: 0.48)
whisper_vad_timestamps_from_probs: VAD segment 2: start = 4.00, end = 4.35 (duration: 0.35)
whisper_vad_timestamps_from_probs: VAD segment 3: start = 5.38, end = 7.65 (duration: 2.27)
whisper_vad_timestamps_from_probs: VAD segment 4: start = 8.16, end = 10.59 (duration: 2.43)
whisper_full_with_state: detected 5 speech segments
whisper_full_with_state: Including segment 0: 0.29 - 2.31 (duration: 2.02)
whisper_full_with_state: Including segment 1: 3.30 - 3.87 (duration: 0.58)
whisper_full_with_state: Including segment 2: 4.00 - 4.45 (duration: 0.45)
whisper_full_with_state: Including segment 3: 5.38 - 7.75 (duration: 2.37)
whisper_full_with_state: Including segment 4: 8.16 - 10.59 (duration: 2.43)
whisper_full_with_state: total duration of speech segments: 7.84 seconds
whisper_full_with_state: Reduced audio from 176000 to 131778 samples (25.1% reduction)

[00:00:00.000 --> 00:00:08.140]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   115.30 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    33.69 ms
whisper_print_timings:   sample time =   225.10 ms /   140 runs (     1.61 ms per run)
whisper_print_timings:   encode time =  9677.55 ms /     1 runs (  9677.55 ms per run)
whisper_print_timings:   decode time =    56.80 ms /     4 runs (    14.20 ms per run)
whisper_print_timings:   batchd time =  1573.50 ms /   132 runs (    11.92 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (     0.00 ms per run)
whisper_print_timings:    total time = 11899.06 ms

@danbev danbev force-pushed the vad branch 3 times, most recently from 5758650 to 9f0ed3d Compare April 25, 2025 05:27
@tannisroot
Copy link

Are there plans to add vad support for server or this is a goal after the PR is merged?

@danbev
Copy link
Collaborator Author

danbev commented Apr 26, 2025

Are there plans to add vad support for server or this is a goal after the PR is merged?

I think it would be nice to get an initial version merged first as this PR is quite large as it is. I can then start looking at adding support to the server, and hopefully during that time people can start trying this out and see what works and does not work.

I'm adding the remaining options to whisper-cli now and after that this should be ready for review.

danbev added 3 commits April 28, 2025 16:17
This commit add support for Voice Activity Detection (VAD). When enabled
this feature will process the audio input and detect speech segments.
This information is then used to reduce the number of samples that need
to be processed by whisper_full.

This initial support is based on the Silero VAD model which needs to
be converted to GGML format:
```console
$ (venv) pip install silero-vad
$ (venv) $ python models/convert-silero-vad-to-ggml.py --output models/silero.bin
 Saving GGML Silero-VAD model to models/silero-v5.1.2-ggml.bin
```

There is test the tests the VAD support in isolation:
```console
$ cmake --build build --target test-vad && \
    ctest -R ^test-vad$ --test-dir build -C Debug --output-on-failure -VV
```

And one that tests VAD in combination with whisper_full:
```console
$ cmake --build build --target test-vad-full && \
    ctest -R test-vad-full --test-dir build -C Debug --output-on-failure -VV
```

Resolves: ggml-org#3003
Example of format:
```console

$ ./build/bin/whisper-cli --help

usage: ./build/bin/whisper-cli [options] file0 file1 ...
supported audio formats: flac, mp3, ogg, wav

options:
  -h,        --help              [default] show this help message and exit
  ...

Voice Activity Detection (VAD) options:
  -v,        --vad                           [false  ] enable Voice Activity Detection (VAD)
  -vm FNAME, --vad-model FNAME               [       ] VAD model path
  -vt N,     --vad-threshold N               [0.50   ] VAD threshold for speech recognition
  -vs N,     --vad_window_size_samples     N [512    ] VAD window size
  -vspd N,   --vad_min_speech_duration_ms  N [250    ] VAD min speech duration
  -vsd N,    --vad_min_silence_duration_ms N [100    ] VAD min silence duration
  -vmsd N,   --vad_max_speech_duration_s   N [FLT_MAX] VAD max speech duration
  -vp N,     --vad_speech_pad_ms           N [30     ] VAD speech padding
  -vo N,     --vad_samples_overlap         N [0.10   ] VAD samples overlap size
```
The main reason for the separate VAD options section is that the VAD
options are longer and made the rest look a little ugly.
This commit adds a job to the CI pipeline to test the VAD model.
This will only test the VAD model in isolation, that is it does not
test whisper_full.
@danbev danbev marked this pull request as ready for review April 28, 2025 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

whisper : add Silero VAD built-in support
2 participants