vad : add initial Voice Activity Detection (VAD) support #3065

danbev · 2025-04-22T09:02:35Z

This commit add support for Voice Activity Detection (VAD). When enabled
this feature will process the audio input and detect speech segments.
This information is then used to reduce the number of samples that need
to be processed by whisper_full.

This initial support is based on the Silero VAD model which needs to be converted to GGML format:

$ (venv) pip install silero-vad
$ (venv) $ python models/convert-silero-vad-to-ggml.py --output models/silero.bin
Saving GGML Silero-VAD model to models/silero-v5.1.2-ggml.bin

There is test the tests the VAD support in isolation:

$ cmake --build build --target test-vad && \
    ctest -R ^test-vad$ --test-dir build -C Debug --output-on-failure -VV

And one that tests VAD in combination with whisper_full:

$ cmake --build build --target test-vad-full && \
    ctest -R test-vad-full --test-dir build -C Debug --output-on-failure -VV

Resolves: #3003

whisper-cli example output

./build/bin/whisper-cli -m models/ggml-base.en.bin -f samples/jfk.wav --vad --vad-model models/for-tests-silero-v5.1.2-ggml.bin
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (12th Gen Intel(R) Core(TM) i7-1260P)
whisper_init_with_params_no_state: devices    = 1
whisper_init_with_params_no_state: backends   = 1
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:          CPU total size =   147.37 MB
whisper_model_load: model size    =  147.37 MB
whisper_backend_init_gpu: no GPU found
whisper_init_state: kv self size  =    6.29 MB
whisper_init_state: kv cross size =   18.87 MB
whisper_init_state: kv pad  size  =    3.15 MB
whisper_init_state: compute buffer (conv)   =   16.26 MB
whisper_init_state: compute buffer (encode) =   85.86 MB
whisper_init_state: compute buffer (cross)  =    4.65 MB
whisper_init_state: compute buffer (decode) =   96.35 MB

system_info: n_threads = 4 / 16 | WHISPER : COREML = 0 | OPENVINO = 0 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...

whisper_full_with_state: VAD is enabled, processing speach segments only
whisper_vad_init_from_file_with_params_no_state: loading VAD model from 'models/for-tests-silero-v5.1.2-ggml.bin'
whisper_vad_init_from_file_with_params_no_state: n_encoder_layers = 4
whisper_vad_init_from_file_with_params_no_state: encoder_in_channels[0] = 129
whisper_vad_init_from_file_with_params_no_state: encoder_in_channels[1] = 128
whisper_vad_init_from_file_with_params_no_state: encoder_in_channels[2] = 64
whisper_vad_init_from_file_with_params_no_state: encoder_in_channels[3] = 64
whisper_vad_init_from_file_with_params_no_state: encoder_out_channels[0] = 128
whisper_vad_init_from_file_with_params_no_state: encoder_out_channels[1] = 64
whisper_vad_init_from_file_with_params_no_state: encoder_out_channels[2] = 64
whisper_vad_init_from_file_with_params_no_state: encoder_out_channels[3] = 128
whisper_vad_init_from_file_with_params_no_state: kernel_sizes[0] = 3
whisper_vad_init_from_file_with_params_no_state: kernel_sizes[1] = 3
whisper_vad_init_from_file_with_params_no_state: kernel_sizes[2] = 3
whisper_vad_init_from_file_with_params_no_state: kernel_sizes[3] = 3
whisper_vad_init_from_file_with_params_no_state: lstm_input_size = 128
whisper_vad_init_from_file_with_params_no_state: lstm_hidden_size = 128
whisper_vad_init_from_file_with_params_no_state: final_conv_in = 128
whisper_vad_init_from_file_with_params_no_state: final_conv_out = 1
whisper_vad_init_from_file_with_params_no_state:          CPU total size =     0.88 MB
whisper_vad_init_from_file_with_params_no_state: model size    =    0.88 MB
whisper_backend_init_gpu: no GPU found
whisper_vad_build_graph: Building VAD graph
whisper_vad_build_graph: stft output shape = [4, 129, 1]
whisper_vad_build_encoder_layer: building encoder layer
whisper_vad_build_graph: endoder output shape = [1, 128, 1]
whisper_vad_build_lstm_layer: building LSTM layer
whisper_vad_build_lstm_layer: hidden dimension = 128
whisper_vad_build_graph: lstm output shape = [128, 1, 1]
whisper_vad_init_state: compute buffer (VAD)   =    1.59 MB
whisper_vad_detect_speech_timestamps: detecting speech timestamps in 176000 samples
whisper_vad_detect_speech: detecting speech in 176000 samples
whisper_vad_detect_speech: n_chunks: 344
whisper_vad_build_graph: Building VAD graph
whisper_vad_build_graph: stft output shape = [4, 129, 1]
whisper_vad_build_encoder_layer: building encoder layer
whisper_vad_build_graph: endoder output shape = [1, 128, 1]
whisper_vad_build_lstm_layer: building LSTM layer
whisper_vad_build_lstm_layer: hidden dimension = 128
whisper_vad_build_graph: lstm output shape = [128, 1, 1]
whisper_vad_detect_speech: props size: 344
whisper_vad_detect_speech: chunk_len: 384 < n_window: 512
whisper_vad_detect_speech: finished processing 176000 samples
whisper_vad_timestamps_from_probs: detecting speech timestamps using 344 probabilities
whisper_vad_timestamps_from_probs: Merged 0 adjacent segments, now have 5 segments
whisper_vad_timestamps_from_probs: Final speech segments after filtering: 5
whisper_vad_timestamps_from_probs: VAD segment 0: start = 0.29, end = 2.21 (duration: 1.92)
whisper_vad_timestamps_from_probs: VAD segment 1: start = 3.30, end = 3.77 (duration: 0.48)
whisper_vad_timestamps_from_probs: VAD segment 2: start = 4.00, end = 4.35 (duration: 0.35)
whisper_vad_timestamps_from_probs: VAD segment 3: start = 5.38, end = 7.65 (duration: 2.27)
whisper_vad_timestamps_from_probs: VAD segment 4: start = 8.16, end = 10.59 (duration: 2.43)
whisper_full_with_state: detected 5 speech segments
whisper_full_with_state: Including segment 0: 0.29 - 2.31 (duration: 2.02)
whisper_full_with_state: Including segment 1: 3.30 - 3.87 (duration: 0.58)
whisper_full_with_state: Including segment 2: 4.00 - 4.45 (duration: 0.45)
whisper_full_with_state: Including segment 3: 5.38 - 7.75 (duration: 2.37)
whisper_full_with_state: Including segment 4: 8.16 - 10.59 (duration: 2.43)
whisper_full_with_state: total duration of speech segments: 7.84 seconds
whisper_full_with_state: Reduced audio from 176000 to 131778 samples (25.1% reduction)

[00:00:00.000 --> 00:00:08.140]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   115.30 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    33.69 ms
whisper_print_timings:   sample time =   225.10 ms /   140 runs (     1.61 ms per run)
whisper_print_timings:   encode time =  9677.55 ms /     1 runs (  9677.55 ms per run)
whisper_print_timings:   decode time =    56.80 ms /     4 runs (    14.20 ms per run)
whisper_print_timings:   batchd time =  1573.50 ms /   132 runs (    11.92 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (     0.00 ms per run)
whisper_print_timings:    total time = 11899.06 ms

tannisroot · 2025-04-25T16:32:19Z

Are there plans to add vad support for server or this is a goal after the PR is merged?

danbev · 2025-04-26T07:00:06Z

Are there plans to add vad support for server or this is a goal after the PR is merged?

I think it would be nice to get an initial version merged first as this PR is quite large as it is. I can then start looking at adding support to the server, and hopefully during that time people can start trying this out and see what works and does not work.

I'm adding the remaining options to whisper-cli now and after that this should be ready for review.

ggerganov · 2025-04-30T12:16:55Z

I am doing some initial testing using long audio and large-v3-turbo and it looks like the quality improves significantly when pre-processing the audio with a VAD.

I am wondering if we can somehow align the output timestamps with the original audio? Right now, I think that the audio that is cut out is not taken into account, so the final timestamps are not aligned with the input audio and it is a bit difficult to evaluate the results.

danbev · 2025-04-30T13:09:29Z

I am wondering if we can somehow align the output timestamps with the original audio? Right now, I think that the audio that is cut out is not taken into account, so the final timestamps are not aligned with the input audio and it is a bit difficult to evaluate the results.

Ah yes, currently what is done is only the samples that are detected to contain speech are passed to whisper_pcm_to_mel_with_state and the reported timestamps in the output will be "according" to those samples. I'll take a closer look at how this can be handled.

With commit the output is now more aligned with the original audio input.

gb0 without VAD

./build/bin/whisper-cli -m models/ggml-base.en.bin -f samples/gb0.wav
...
[00:00:00.000 --> 00:00:03.240]   Good morning, this Tuesday is Election Day.   
[00:00:03.240 --> 00:00:06.000]   After months of spirited debate and vigorous campaigning,
[00:00:06.000 --> 00:00:08.640]   the time has come for Americans to make important decisions
[00:00:08.640 --> 00:00:10.140]   about our nation's future.                    
[00:00:10.140 --> 00:00:13.740]   I encourage all Americans to go to the polls and vote.
[00:00:13.740 --> 00:00:16.140]   Election season brings out the spirit of competition
[00:00:16.140 --> 00:00:18.080]   between our political parties.                    
[00:00:18.080 --> 00:00:20.280]   And that competition is an essential part     
[00:00:20.280 --> 00:00:21.780]   of a healthy democracy.                       
[00:00:21.780 --> 00:00:23.520]   But as the campaigns come to a close,         
[00:00:23.520 --> 00:00:25.980]   Republicans, Democrats, and independents      
[00:00:25.980 --> 00:00:29.120]   can find common ground on at least one point. 
[00:00:29.120 --> 00:00:31.560]   Our system of representative democracy        
[00:00:31.560 --> 00:00:34.440]   is one of America's greatest strengths.       
[00:00:34.440 --> 00:00:36.240]   The United States was founded on the belief   
[00:00:36.240 --> 00:00:38.240]   that all men are created equal.               
[00:00:38.240 --> 00:00:41.440]   Every election day, millions of Americans of all races,
[00:00:41.440 --> 00:00:43.440]   religions, and backgrounds step into voting       
[00:00:43.440 --> 00:00:45.280]   booths throughout the nation.                     
[00:00:45.280 --> 00:00:47.780]   Whether they are rich or poor, old or young,  
[00:00:47.780 --> 00:00:50.680]   each of them has an equal share in choosing the path
[00:00:50.680 --> 00:00:52.440]   that our country will take.                   
[00:00:52.440 --> 00:00:54.920]   And every ballot they cast is a reminder      
[00:00:54.920 --> 00:00:58.280]   that our founding principles are alive and well.
[00:00:58.280 --> 00:00:59.760]   Voting is one of the great privileges         
[00:00:59.760 --> 00:01:01.760]   of American citizenship.                      
[00:01:01.760 --> 00:01:04.520]   And it has always required brave defenders.   
[00:01:04.520 --> 00:01:06.000]   As you head to the polls next week,           
[00:01:06.000 --> 00:01:08.400]   remember the sacrifices that have been made   
[00:01:08.400 --> 00:01:11.040]   by generations of Americans in uniform        
[00:01:11.040 --> 00:01:13.000]   to preserve our way of life.                  
[00:01:13.000 --> 00:01:14.840]   From Bunker Hill to Baghdad,                  
[00:01:14.840 --> 00:01:16.740]   the men and women of American armed forces    
[00:01:16.740 --> 00:01:19.940]   have been devoted guardians of our democracy. 
[00:01:19.940 --> 00:01:21.840]   All of us owe them and their families         
[00:01:21.840 --> 00:01:25.240]   a special debt of gratitude on Election Day.  
[00:01:25.240 --> 00:01:27.520]   Americans should also remember the important example
[00:01:27.520 --> 00:01:30.080]   that our election set throughout the world.   
[00:01:30.080 --> 00:01:32.080]   Young democracies from Georgia and Ukraine    
[00:01:32.080 --> 00:01:34.560]   to Afghanistan and Iraq can look to the United States
[00:01:34.560 --> 00:01:37.520]   for proof that self-government can endure.    
[00:01:37.520 --> 00:01:40.400]   And nations that still live under tyranny and oppression
[00:01:40.400 --> 00:01:44.080]   can find hope and inspiration in our commitment to liberty.
[00:01:44.080 --> 00:01:45.200]   For more than two centuries,                  
[00:01:45.200 --> 00:01:47.120]   Americans have demonstrated the ability       
[00:01:47.120 --> 00:01:49.600]   of free people to choose their own leaders.   
[00:01:49.600 --> 00:01:51.880]   Our nation has flourished because of its commitment
[00:01:51.880 --> 00:01:54.640]   to trusting the wisdom of our citizenry.      
[00:01:54.640 --> 00:01:57.200]   In this year's election, we will see this tradition
[00:01:57.200 --> 00:02:00.280]   continue, and we will be reminded once again  
[00:02:00.280 --> 00:02:02.640]   that we are blessed to live in a free nation  
[00:02:02.640 --> 00:02:05.520]   guided by the will of the people.             
[00:02:05.520 --> 00:02:06.720]   Thank you for listening.

gb0 with VAD

./build/bin/whisper-cli -m models/ggml-base.en.bin -f samples/gb0.wav --vad --vad-threshold 0.5 --vad-model models/for-tests-silero-v5.1.2-ggml.bin
...
[00:00:00.000 --> 00:00:03.280]   Good morning, this Tuesday is Election Day.
[00:00:03.280 --> 00:00:06.000]   After months of spirited debate and vigorous campaigning,
[00:00:06.000 --> 00:00:08.600]   the time has come for Americans to make important decisions
[00:00:08.600 --> 00:00:10.200]   about our nation's future.
[00:00:10.200 --> 00:00:13.790]   Encourage all Americans to go to the polls and vote.
[00:00:13.790 --> 00:00:16.120]   Election season brings out the spirit of competition
[00:00:16.120 --> 00:00:18.060]   between our political parties.
[00:00:18.060 --> 00:00:20.230]   And that competition is an essential part
[00:00:20.230 --> 00:00:21.820]   of a healthy democracy.
[00:00:21.820 --> 00:00:23.550]   But as the campaigns come to a close,
[00:00:23.550 --> 00:00:25.960]   Republicans, Democrats, and independents
[00:00:25.960 --> 00:00:29.180]   can find common ground on at least one point.
[00:00:29.180 --> 00:00:31.530]   Our system of representative democracy
[00:00:31.530 --> 00:00:34.470]   is one of America's greatest strengths.
[00:00:34.470 --> 00:00:36.250]   The United States was founded on the belief
[00:00:36.250 --> 00:00:38.310]   that all men are created equal.
[00:00:38.310 --> 00:00:40.740]   Every election day, millions of Americans
[00:00:40.740 --> 00:00:42.630]   of all races, religions, and backgrounds
[00:00:42.630 --> 00:00:45.340]   step into voting booths throughout the nation.
[00:00:45.340 --> 00:00:48.530]   Whether they are rich or poor, old or young, each of them
[00:00:48.530 --> 00:00:50.660]   has an equal share in choosing the path
[00:00:50.660 --> 00:00:52.480]   that our country will take.
[00:00:52.480 --> 00:00:54.910]   And every ballot they cast is a reminder
[00:00:54.910 --> 00:00:58.330]   that our founding principles are alive and well.
[00:00:58.330 --> 00:00:59.760]   Voting is one of the great privileges
[00:00:59.760 --> 00:01:01.810]   of American citizenship.
[00:01:01.810 --> 00:01:04.550]   And it is always required brave defenders.
[00:01:04.550 --> 00:01:06.050]   As you head to the polls next week,
[00:01:06.050 --> 00:01:08.380]   remember the sacrifices that have been made
[00:01:08.380 --> 00:01:11.580]   by generations of Americans in uniform to preserve
[00:01:11.580 --> 00:01:13.010]   our way of life.
[00:01:13.010 --> 00:01:15.450]   From Bunker Hill to Baghdad, the men and women
[00:01:15.450 --> 00:01:17.030]   of American armed forces have been
[00:01:17.030 --> 00:01:19.990]   devoted guardians of our democracy.
[00:01:19.990 --> 00:01:21.790]   All of us owe them and their families
[00:01:21.790 --> 00:01:25.260]   a special debt of gratitude on election day.
[00:01:25.260 --> 00:01:27.520]   Americans should also remember the important example
[00:01:27.520 --> 00:01:30.090]   that our elections set throughout the world.
[00:01:30.090 --> 00:01:32.070]   Young democracies from Georgia and Ukraine
[00:01:32.070 --> 00:01:34.520]   to Afghanistan and Iraq can look to the United States
[00:01:34.520 --> 00:01:37.450]   for proof that self-government can endure.
[00:01:37.450 --> 00:01:40.400]   And nations that still live under tyranny and oppression
[00:01:40.400 --> 00:01:44.080]   can find hope and inspiration in our commitment to liberty.
[00:01:44.080 --> 00:01:45.690]   For more than two centuries, Americans
[00:01:45.690 --> 00:01:47.730]   have demonstrated the ability of free people
[00:01:47.730 --> 00:01:49.600]   to choose their own leaders.
[00:01:49.600 --> 00:01:51.830]   Our nation has flourished because of its commitment
[00:01:51.830 --> 00:01:54.630]   to trusting the wisdom of our citizenry.
[00:01:54.630 --> 00:01:58.460]   In this year's election, we will see this tradition continue.
[00:01:58.460 --> 00:02:00.220]   And we will be reminded once again
[00:02:00.220 --> 00:02:02.590]   that we are blessed to live in a free nation
[00:02:02.590 --> 00:02:05.490]   guided by the will of the people.
[00:02:05.490 --> 00:02:06.650]   Thank you for listening.

danbev · 2025-05-05T11:12:37Z

I've discovered an issue when running the VAD processing on a GPU(CUDA) backend where it produces different probabilities than the CPU backend does. I'm looking into this now.

I'm also not sure about if there is a benefit to using a GPU for the VAD processing. The tensors are quite small and we process samples sequentially in chunks of 512 (padded with reflective padding of 64 samples left and right for a total of 640 samples) as it is right now. The overhead of the memory transfer, kernel launch, synchronization etc might cost more than we actually gain from using a GPU. I'm also struggling getting this to work for CUDA and before spending more time on trying to sort this out I'd like to hear what others think and if I should peruse this or if we should limit the VAD processing to CPU only, at least for now?
@ggerganov What are your thoughts on this?

mrfragger · 2025-05-06T18:53:45Z

I am wondering if we can somehow align the output timestamps with the original audio? Right now, I think that the audio that is cut out is not taken into account, so the final timestamps are not aligned with the input audio and it is a bit difficult to evaluate the results.

I check for repeating timecodes with 4 or more lines repeating. Usually there are around 30 seconds in duration. If less than 9 seconds automatically delete them. Then adjust timecodes to evenly distrubute the timecodes from the 20 seconds or 30 seconds whatever it is for each subtitle line that has a repeating timecode.. Once done, insert them back into the original vtt with the adjusted timecodes.

So with VAD I'd imagine you have to make note of the timecodes removed and calculate duration for each chunk and reinsert them. However, this is more complicated cuz each time it would have to shift the timecodes each time for the subs after each section that was inserted. I think that's not practical and but I suppose it would be doable.

Myself I just remove silence and/or hiss with ffmpeg when making a audiobook then whisper.cpp to transcribe that avoiding any silent sections. Pretty much what VAD does I'm guessing.

showing example the (30) is the duration of repeating timecodes
Created file: 12-07-36.915-->12-08-06.902(30).vtt

Created file: 12-12-06.953-->12-12-36.960(31).vtt

Created file: 12-22-34.770-->12-23-07.097(33).vtt

Created file: 14-14-52.974-->14-15-22.981(31).vtt

Created file: 14-52-01.393-->14-52-01.893(1).vtt

Created file: 15-09-02.124-->15-09-07.131(6).vtt

Created file: 15-09-07.131-->15-09-13.131(6).vtt

Created file: 15-23-20.611-->15-23-44.642(25).vtt

Press ENTER to delete any timecodes with a duration less than 10 seconds

Deleted 02-03-54.118-->02-04-02.948(9).vtt
Deleted 06-07-24.769-->06-07-27.759(3).vtt
Deleted 09-41-46.313-->09-41-50.243(4).vtt
Deleted 11-54-39.693-->11-54-41.752(3).vtt
Deleted 14-52-01.393-->14-52-01.893(1).vtt
Deleted 15-09-02.124-->15-09-07.131(6).vtt
Deleted 15-09-07.131-->15-09-13.131(6).vtt

src/whisper.cpp

ggerganov · 2025-05-07T13:21:37Z

src/whisper.cpp

+        // Check if the timestamp falls within this segment.
+        if (t0 >= segment.vad_start && t0 <= segment.vad_end) {
+            float proportion = 0.0f;
+            if (segment.vad_end > segment.vad_start) {
+                proportion = (t0 - segment.vad_start) / (segment.vad_end - segment.vad_start);
+            }
+            float orig_t0 = segment.orig_start + proportion * (segment.orig_end - segment.orig_start);
+            return (int64_t)(orig_t0 * 100);
+        }
+    }


I am not sure if this is the best logic for restoring the timestamps because it is scaling the speech back to the original length of the segment.

For example, if we consider the following 30 seconds audio (. - silence, x - speech):

.............................................................xxxxx hello

This will produce the final transcribed segment as something like:

[00:00:00.000 --> 00:00:30.000] hello

While it would be better to produce the more accurate:

[00:00:25.000 --> 00:00:30.000] hello

I'll take another stab at this 👍

ggerganov · 2025-05-07T13:22:59Z

We should eventually figure out why the GPU inference fails, but we can do it later. For now, we should add a way to easily enable GPU vad, and have it disabled by default.

TeslaKang · 2025-05-08T05:56:42Z

vad can't process this file.
Please look into it.
boa.zip

This commit adds two new members, h_state and c_state, to the whisper_vad_state structure. These members are used to store the hidden and cell states to avoid having to get and set the LSTM states in the processing. Refs: ggml-org#3065 (comment)

ggerganov · 2025-05-09T07:21:11Z

@danbev I think I fixed the GPU support - at least it runs with Metal. Can you check if it runs correctly with CUDA?

danbev · 2025-05-09T07:31:20Z

@danbev I think I fixed the GPU support - at least it runs with Metal. Can you check if it runs correctly with CUDA?

Awesome, that worked like a charm! I'll take a closer look now and see what I missed. Thanks!

src/whisper.cpp

This commit add support for Voice Activity Detection (VAD). When enabled this feature will process the audio input and detect speech segments. This information is then used to reduce the number of samples that need to be processed by whisper_full. This initial support is based on the Silero VAD model which needs to be converted to GGML format: ```console $ (venv) pip install silero-vad $ (venv) $ python models/convert-silero-vad-to-ggml.py --output models/silero.bin Saving GGML Silero-VAD model to models/silero-v5.1.2-ggml.bin ``` There is test the tests the VAD support in isolation: ```console $ cmake --build build --target test-vad && \ ctest -R ^test-vad$ --test-dir build -C Debug --output-on-failure -VV ``` And one that tests VAD in combination with whisper_full: ```console $ cmake --build build --target test-vad-full && \ ctest -R test-vad-full --test-dir build -C Debug --output-on-failure -VV ``` Resolves: ggml-org#3003

Example of format: ```console $ ./build/bin/whisper-cli --help usage: ./build/bin/whisper-cli [options] file0 file1 ... supported audio formats: flac, mp3, ogg, wav options: -h, --help [default] show this help message and exit ... Voice Activity Detection (VAD) options: -v, --vad [false ] enable Voice Activity Detection (VAD) -vm FNAME, --vad-model FNAME [ ] VAD model path -vt N, --vad-threshold N [0.50 ] VAD threshold for speech recognition -vs N, --vad_window_size_samples N [512 ] VAD window size -vspd N, --vad_min_speech_duration_ms N [250 ] VAD min speech duration -vsd N, --vad_min_silence_duration_ms N [100 ] VAD min silence duration -vmsd N, --vad_max_speech_duration_s N [FLT_MAX] VAD max speech duration -vp N, --vad_speech_pad_ms N [30 ] VAD speech padding -vo N, --vad_samples_overlap N [0.10 ] VAD samples overlap size ``` The main reason for the separate VAD options section is that the VAD options are longer and made the rest look a little ugly.

This commit adds a job to the CI pipeline to test the VAD model. This will only test the VAD model in isolation, that is it does not test whisper_full.

This commit adds a mapping of the original audio timestamps to the timestamps of the segments in the VAD (Voice Activity Detection) process. The motivation for this change is when we process the original audio signal and only pass the speech segments to whisper_full, the timestamps that whisper returnes when calling functions like whisper_full_get_segment_t0 are the timestamps for the "VAD" segments and not the original audio. The values are not identical to the the timestamps processed without VAD enabled but they are close, and hopefully close enough.

Free filtered samples after VAD processing.

This commit extracts the VAD processing from whisper_full_with_state to an separate function to make the code more readable.

This commit modifies the VAD code to only use the CPU backend for VAD processing. There is currently an issue with the GPU backend which I need to investigate further. It is also not clear to me if running the VAD processing on a GPU is actually beneficial.

include/whisper.h

MahmoudAshraf97 · 2025-05-10T21:22:16Z

Hello @danbev , I saw your notes and I figured I can help by sharing two reference implementations of Silero V5 model in pytorch, these are verified to work exactly as the original model and they are used in faster-whisper (I'm the current maintained btw)

Most Recent: this is used in a high-throughput environment where you can batch both single audio files and multiple audio files, it reproduces the original model probabilities with 1e-5 atol
Initial Implementation This is the one used in faster whisper, I'd suggest using the first one as it's simpler IMO, but this one can give you more insights and details if needed

Both implementations are almost 3x faster than the original implementation due to batching while producing the same results, they also decouple state management from the model class for easier understanding and implementation

LMK if I can be of any help
Best,

danbev · 2025-05-11T04:30:12Z

@MahmoudAshraf97 Thank you for those links and the information, that will be very helpful as we have talked about looking into Silero V5 after this and having this information is super!

This commit removes the `whisper_vad_probs` struct and the code now uses the probs in the vad context directly.

This commit merges the whisper_vad_state with the whisper_vad_context struct and removes the whisper_vad_state struct.

TeslaKang · 2025-05-12T01:32:14Z

This audio file contains music and only three segments of speech can be detected in it, which is the same number of segments that silero-vad also detects, so I believe the file can be processed but perhaps not in the way you expected.

I've not really used whisper for music/lyrics and I suspect that since the language does not sound like English a model that supports the language in question would be required. There are output format like --output-lrc (lyrics) which might work, but I've not tried this.

However..
Fast whisper detects it properly and works well.
I don't know much about Python, so I tested it with the standalone executable(https://github.com/Purfview/whisper-standalone-win)

This commit removes the window_size_samples parameter from the VAD parameters in the Whisper library. The window size is now determined by the number of windows in the VAD context, which is set to 512 by default and is set by the model.

This commit adds a VAD section to the README.md file, providing information about Voice Activity Detection (VAD) support. I've included the whisper-cli options mainly becuase adding them to the usage/help output would make the output a bit long and inconsistent as the other options are pretty short and concise.

Fix vad model typo and spelling mistake.

ggerganov

Nice job!

Some ideas for the future:

Add example to to split input file into VAD segments
Add WASM/CLI example for real-time VAD
Improve performance

README.md

Change underscores to dashes in VAD options.

Use dashes instead of underscores for cli options.

vrs · 2025-05-12T16:34:37Z

Would it be possible to leave --vad as longform flag only? -v usually means --verbose or (sometimes) --version.

danbev · 2025-05-12T17:33:41Z

Would it be possible to leave --vad as longform flag only? -v usually means --verbose or (sometimes) --version.

Ah good point, I did not consider that. I'll open a PR with this change. Thanks!

This commit removes the shortform for the --vad option in cli.cpp. The motivation for this is that `-v` is often used for verbose or version is many tools and this might cause confusion. Refs: ggml-org#3065 (comment)

This commit removes the shortform for the --vad option in cli.cpp. The motivation for this is that `-v` is often used for verbose or version is many tools and this might cause confusion. Refs: #3065 (comment)

mdestagnol · 2025-05-13T07:46:25Z

Awesome work on this @danbev. I know that the first intent with this is to make whisper more efficient by only processing frames containing speech. But another potential use case would be to use VAD during real time transcription to decide when to chunk the audio to process. Is that what you have in mind as well? do you plan to implement it in a future contribution or not for now? Just curious. Thanks again for this amazing library

danbev · 2025-05-14T04:39:22Z

do you plan to implement it in a future contribution or not for now?

I think we do plan on doing this as a follow up, and I think this might be what @ggerganov had in mind above:

Add WASM/CLI example for real-time VAD

Just so that I understand what we are talking about here, this would be using VAD in combination with something like the stream example?

ggerganov · 2025-05-14T14:34:57Z

Just so that I understand what we are talking about here, this would be using VAD in combination with something like the stream example?

Yes, something like this. But I also don't have a very specific idea in mind. Just something that would showcase real-time VAD. It can even be VAD-only (i.e. not even running the Whisper inference).

For example, this seems like a good demo that we can replicate:

https://www.vad.ricky0123.com

mdestagnol · 2025-05-15T01:40:37Z

@ggerganov @danbev currently when I run the whisper-stream example versus the whisper-cli, I notice that the result is not as accurate. I suspected maybe an issue with the previous sliding window + VAD implementation. I was hoping that by using Silero to detect in real time when can we chunk the audio (during silences) + a sliding window with some overlap we could get the real time example to be more accurate. What do you think?

danbev · 2025-05-15T03:55:58Z

@mdestagnol The stream example has not been updated yet and is not using the VAD implementation from this PR. I'll take a look at updating it.

WilliamTambellini · 2025-06-05T17:37:14Z

the vad model has been uploaded there:
https://huggingface.co/ggml-org/silero-v5.1.2
Tks @danbev

danbev force-pushed the vad branch 3 times, most recently from 5758650 to 9f0ed3d Compare April 25, 2025 05:27

danbev force-pushed the vad branch 2 times, most recently from b59768b to 798695f Compare April 28, 2025 14:21

danbev marked this pull request as ready for review April 28, 2025 14:21

danbev force-pushed the vad branch from 798695f to 6b56b7d Compare May 2, 2025 13:55

ggerganov reviewed May 7, 2025

View reviewed changes

ggerganov force-pushed the vad branch from 5a6236e to 60d561b Compare May 9, 2025 07:19

ggerganov reviewed May 9, 2025

View reviewed changes

src/whisper.cpp Outdated Show resolved Hide resolved

danbev commented May 9, 2025

View reviewed changes

src/whisper.cpp Outdated Show resolved Hide resolved

danbev added 8 commits May 9, 2025 15:25

ci : add job to test VAD

eb23253

This commit adds a job to the CI pipeline to test the VAD model. This will only test the VAD model in isolation, that is it does not test whisper_full.

squash! vad : add initial Voice Activity Detection (VAD) support [no ci]

37a36a3

Free filtered samples after VAD processing.

vad : extract VAD processing to a separate function

033c0ce

This commit extracts the VAD processing from whisper_full_with_state to an separate function to make the code more readable.

vad : add TODOs to optimize segment access [no ci]

028481e

ggerganov reviewed May 10, 2025

View reviewed changes

include/whisper.h Outdated Show resolved Hide resolved

vad : remove n_segments from struct whisper_vad_timestamps

f212310

vad : rename whisper_vad_timestamps to whisper_vad_segments [no ci]

4c7fe00

danbev added 2 commits May 11, 2025 07:27

vad : remove whisper_vad_probs struct [no ci]

dc541f9

This commit removes the `whisper_vad_probs` struct and the code now uses the probs in the vad context directly.

vad : remove whisper_vad_state struct

163ad53

This commit merges the whisper_vad_state with the whisper_vad_context struct and removes the whisper_vad_state struct.

ggerganov mentioned this pull request May 12, 2025

tests : add WER benchmarks #2454

Open

danbev and others added 5 commits May 12, 2025 13:23

vad : remove window_size_samples from VAD params

810981f

This commit removes the window_size_samples parameter from the VAD parameters in the Whisper library. The window size is now determined by the number of windows in the VAD context, which is set to 512 by default and is set by the model.

vad : clarify VAD CLI options [no ci]

050038c

squash! docs : add VAD section to README.md [no ci]

acc8747

Fix vad model typo and spelling mistake.

vad : minor rename

7aac6ec

ggerganov approved these changes May 12, 2025

View reviewed changes

ggerganov reviewed May 12, 2025

View reviewed changes

README.md Outdated Show resolved Hide resolved

danbev added 2 commits May 12, 2025 15:54

squash! docs : add VAD section to README.md [no ci]

41c2010

Change underscores to dashes in VAD options.

vad : fix cli option names [no ci]

67f0fd4

Use dashes instead of underscores for cli options.

danbev merged commit e41bc5c into ggml-org:master May 12, 2025

danbev mentioned this pull request May 12, 2025

vad : remove shortform for --vad option in cli.cpp #3145

Merged

vad : add initial Voice Activity Detection (VAD) support #3065

vad : add initial Voice Activity Detection (VAD) support #3065

Uh oh!

Conversation

danbev commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tannisroot commented Apr 25, 2025

Uh oh!

danbev commented Apr 26, 2025

Uh oh!

ggerganov commented Apr 30, 2025

Uh oh!

danbev commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danbev commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrfragger commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov May 7, 2025

Choose a reason for hiding this comment

Uh oh!

danbev May 8, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov commented May 7, 2025

Uh oh!

TeslaKang commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented May 9, 2025

Uh oh!

danbev commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MahmoudAshraf97 commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danbev commented May 11, 2025

Uh oh!

TeslaKang commented May 12, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vrs commented May 12, 2025

Uh oh!

danbev commented May 12, 2025

Uh oh!

mdestagnol commented May 13, 2025

Uh oh!

danbev commented May 14, 2025

Uh oh!

ggerganov commented May 14, 2025

Uh oh!

mdestagnol commented May 15, 2025

Uh oh!

danbev commented May 15, 2025

Uh oh!

WilliamTambellini commented Jun 5, 2025

Uh oh!

Uh oh!

danbev commented Apr 22, 2025 •

edited

Loading

danbev commented Apr 30, 2025 •

edited

Loading

danbev commented May 5, 2025 •

edited

Loading

mrfragger commented May 6, 2025 •

edited

Loading

TeslaKang commented May 8, 2025 •

edited

Loading

danbev commented May 9, 2025 •

edited

Loading

MahmoudAshraf97 commented May 10, 2025 •

edited

Loading