Skip to content

whisper : add Silero VAD built-in support #3003

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ggerganov opened this issue Apr 4, 2025 · 5 comments · May be fixed by #3065
Open

whisper : add Silero VAD built-in support #3003

ggerganov opened this issue Apr 4, 2025 · 5 comments · May be fixed by #3065
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers roadmap Part of a roadmap project

Comments

@ggerganov
Copy link
Member

It seems the Silero VAD project is a very good candidate for being integrated directly in whisper.cpp:

https://github.com/snakers4/silero-vad

The model appears to be very lightweight and can be used to extract voice activity timestamps that the user can then use to process with the core whisper.cpp functionality, avoiding non-speech segments.

The model data of Silero VAD is stored in ONNX data format. From a quick search, I cannot find where the actual model tensors and architecture are defined, so we'll have to do some reverse engineering to convert this to a GGUF file. The only information I was able to find is that the model uses a multi-head attention (MHA) + short-time Fourier transform, which shouldn't be a problem to implement within whisper.cpp.

Having an integrated VAD solution in the project would simplify the work of 3rd-party projects using whisper.cpp. It is also likely to improve the notoriously bad performance of the Whisper Large v3 models.

@ggerganov ggerganov added enhancement New feature or request good first issue Good for newcomers roadmap Part of a roadmap project labels Apr 4, 2025
@JRWSP
Copy link

JRWSP commented Apr 4, 2025

I have been using the Silero VAD with whisper.cpp by making a script to manage files manually.
Don't know if this is what you need.

https://github.com/JRWSP/SileroVAD_for_Whisper-cpp

@tbarbugli
Copy link

I used the VAD in Go using onnxruntime, you can see the interface here https://github.com/streamer45/silero-vad-go/blob/master/speech/ort_bridge.h

@ggerganov
Copy link
Member Author

Thanks. The goal is to not depend on ONNX or python and to natively run the VAD using ggml.

@ggerganov ggerganov moved this from Todo to In Progress in whisper.cpp : roadmap Apr 4, 2025
@danbev
Copy link
Collaborator

danbev commented Apr 8, 2025

I want to try it.

Just for your information and so we don't duplicate efforts, I've started working on this issue (if an issue is assigned it is most often being worked on). I hope to be able to share some progress soon.

Sorry about the late response, I have gathered some notes about the progress and I hope to have a pull request for this next week (short week here due to Easter).

@basemkhirat
Copy link

Thanks. waiting for it.

danbev added a commit to danbev/whisper.cpp that referenced this issue Apr 22, 2025
This commit add support for Voice Activity Detection (VAD). This is
currently a work in progress and is not yet fully functional.

A silero-vad model can be converted using:
```console
$ (venv) pip install silero-vad
 $ (venv) $ python models/convert-silero-vad-to-ggml.py --output models/silero.bin
 Saving GGML Silero-VAD model to models/silero-v5.1.2-ggml.bin
```

And there is test the tests the VAD support in isolation:
```console
$ cmake --build build --target test-vad && \
    ctest -R ^test-vad$ --test-dir build --output-on-failure -VV
```

And one that tests VAD in combination with whisper_full:
```console
$ cmake --build build --target test-vad-full && \
    ctest -R test-vad-full --test-dir build --output-on-failure -VV
```

Resolves: ggml-org#3003
danbev added a commit to danbev/whisper.cpp that referenced this issue Apr 22, 2025
This commit add support for Voice Activity Detection (VAD). This is
currently a work in progress and is not yet fully functional.

A silero-vad model can be converted using:
```console
$ (venv) pip install silero-vad
$ (venv) $ python models/convert-silero-vad-to-ggml.py --output models/silero.bin
 Saving GGML Silero-VAD model to models/silero-v5.1.2-ggml.bin
```

And there is test the tests the VAD support in isolation:
```console
$ cmake --build build --target test-vad && \
    ctest -R ^test-vad$ --test-dir build --output-on-failure -VV
```

And one that tests VAD in combination with whisper_full:
```console
$ cmake --build build --target test-vad-full && \
    ctest -R test-vad-full --test-dir build --output-on-failure -VV
```

Resolves: ggml-org#3003
danbev added a commit to danbev/whisper.cpp that referenced this issue Apr 23, 2025
This commit add support for Voice Activity Detection (VAD). This is
currently a work in progress and is not yet fully functional.

A silero-vad model can be converted using:
```console
$ (venv) pip install silero-vad
$ (venv) $ python models/convert-silero-vad-to-ggml.py --output models/silero.bin
 Saving GGML Silero-VAD model to models/silero-v5.1.2-ggml.bin
```

And there is test the tests the VAD support in isolation:
```console
$ cmake --build build --target test-vad && \
    ctest -R ^test-vad$ --test-dir build --output-on-failure -VV
```

And one that tests VAD in combination with whisper_full:
```console
$ cmake --build build --target test-vad-full && \
    ctest -R test-vad-full --test-dir build --output-on-failure -VV
```

Resolves: ggml-org#3003
danbev added a commit to danbev/whisper.cpp that referenced this issue Apr 25, 2025
This commit add support for Voice Activity Detection (VAD). This is
currently a work in progress and is not yet fully functional.

A silero-vad model can be converted using:
```console
$ (venv) pip install silero-vad
$ (venv) $ python models/convert-silero-vad-to-ggml.py --output models/silero.bin
 Saving GGML Silero-VAD model to models/silero-v5.1.2-ggml.bin
```

And there is test the tests the VAD support in isolation:
```console
$ cmake --build build --target test-vad && \
    ctest -R ^test-vad$ --test-dir build -C Debug --output-on-failure -VV
```

And one that tests VAD in combination with whisper_full:
```console
$ cmake --build build --target test-vad-full && \
    ctest -R test-vad-full --test-dir build -C Debug --output-on-failure -VV
```

Resolves: ggml-org#3003
danbev added a commit to danbev/whisper.cpp that referenced this issue Apr 28, 2025
This commit add support for Voice Activity Detection (VAD). When enabled
this feature will process the audio input and detect speech segments.
This information is then used to reduce the number of samples that need
to be processed by whisper_full.

This initial support is based on the Silero VAD model which needs to
be converted to GGML format:
```console
$ (venv) pip install silero-vad
$ (venv) $ python models/convert-silero-vad-to-ggml.py --output models/silero.bin
 Saving GGML Silero-VAD model to models/silero-v5.1.2-ggml.bin
```

There is test the tests the VAD support in isolation:
```console
$ cmake --build build --target test-vad && \
    ctest -R ^test-vad$ --test-dir build -C Debug --output-on-failure -VV
```

And one that tests VAD in combination with whisper_full:
```console
$ cmake --build build --target test-vad-full && \
    ctest -R test-vad-full --test-dir build -C Debug --output-on-failure -VV
```

Resolves: ggml-org#3003
danbev added a commit to danbev/whisper.cpp that referenced this issue Apr 28, 2025
This commit add support for Voice Activity Detection (VAD). When enabled
this feature will process the audio input and detect speech segments.
This information is then used to reduce the number of samples that need
to be processed by whisper_full.

This initial support is based on the Silero VAD model which needs to
be converted to GGML format:
```console
$ (venv) pip install silero-vad
$ (venv) $ python models/convert-silero-vad-to-ggml.py --output models/silero.bin
 Saving GGML Silero-VAD model to models/silero-v5.1.2-ggml.bin
```

There is test the tests the VAD support in isolation:
```console
$ cmake --build build --target test-vad && \
    ctest -R ^test-vad$ --test-dir build -C Debug --output-on-failure -VV
```

And one that tests VAD in combination with whisper_full:
```console
$ cmake --build build --target test-vad-full && \
    ctest -R test-vad-full --test-dir build -C Debug --output-on-failure -VV
```

Resolves: ggml-org#3003
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers roadmap Part of a roadmap project
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

5 participants