Skip to content

feat: support vad for addon.node #3301

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 2, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 81 additions & 3 deletions examples/addon.node/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
# addon
# whisper.cpp Node.js addon

This is an addon demo that can **perform whisper model reasoning in `node` and `electron` environments**, based on [cmake-js](https://github.com/cmake-js/cmake-js).
It can be used as a reference for using the whisper.cpp project in other node projects.

This addon now supports **Voice Activity Detection (VAD)** for improved transcription performance.

## Install

```shell
Expand All @@ -26,12 +28,88 @@ For Electron addon and cmake-js options, you can see [cmake-js](https://github.c
## Run
### Basic Usage
```shell
cd examples/addon.node
node index.js --language='language' --model='model-path' --fname_inp='file-path'
```
Because this is a simple Demo, only the above parameters are set in the node environment.
### VAD (Voice Activity Detection) Usage

Run the VAD example with performance comparison:

```shell
node vad-example.js
```

## Voice Activity Detection (VAD) Support

VAD can significantly improve transcription performance by only processing speech segments, which is especially beneficial for audio files with long periods of silence.

### VAD Model Setup

Before using VAD, download a VAD model:

```shell
# From the whisper.cpp root directory
./models/download-vad-model.sh silero-v5.1.2
```

### VAD Parameters

All VAD parameters are optional and have sensible defaults:

- `vad`: Enable VAD (default: false)
- `vad_model`: Path to VAD model file (required when VAD enabled)
- `vad_threshold`: Speech detection threshold 0.0-1.0 (default: 0.5)
- `vad_min_speech_duration_ms`: Min speech duration in ms (default: 250)
- `vad_min_silence_duration_ms`: Min silence duration in ms (default: 100)
- `vad_max_speech_duration_s`: Max speech duration in seconds (default: FLT_MAX)
- `vad_speech_pad_ms`: Speech padding in ms (default: 30)
- `vad_samples_overlap`: Sample overlap 0.0-1.0 (default: 0.1)

### JavaScript API Example

```javascript
const path = require("path");
const { whisper } = require(path.join(__dirname, "../../build/Release/addon.node"));
const { promisify } = require("util");

const whisperAsync = promisify(whisper);

// With VAD enabled
const vadParams = {
language: "en",
model: path.join(__dirname, "../../models/ggml-base.en.bin"),
fname_inp: path.join(__dirname, "../../samples/jfk.wav"),
vad: true,
vad_model: path.join(__dirname, "../../models/ggml-silero-v5.1.2.bin"),
vad_threshold: 0.5,
progress_callback: (progress) => console.log(`Progress: ${progress}%`)
};

whisperAsync(vadParams).then(result => console.log(result));
```

## Supported Parameters

Both traditional whisper.cpp parameters and new VAD parameters are supported:

Other parameters can also be specified in the node environment.
- `language`: Language code (e.g., "en", "es", "fr")
- `model`: Path to whisper model file
- `fname_inp`: Path to input audio file
- `use_gpu`: Enable GPU acceleration (default: true)
- `flash_attn`: Enable flash attention (default: false)
- `no_prints`: Disable console output (default: false)
- `no_timestamps`: Disable timestamps (default: false)
- `detect_language`: Auto-detect language (default: false)
- `audio_ctx`: Audio context size (default: 0)
- `max_len`: Maximum segment length (default: 0)
- `max_context`: Maximum context size (default: -1)
- `prompt`: Initial prompt for decoder
- `comma_in_time`: Use comma in timestamps (default: true)
- `print_progress`: Print progress info (default: false)
- `progress_callback`: Progress callback function
- VAD parameters (see above section)
144 changes: 119 additions & 25 deletions examples/addon.node/__test__/whisper.spec.js
Original file line number Diff line number Diff line change
@@ -1,39 +1,133 @@
const path = require("path");
const { whisper } = require(path.join(
__dirname,
"../../../build/Release/addon.node"
));
const { promisify } = require("util");
const { join } = require('path');
const { whisper } = require('../../../build/Release/addon.node');
const { promisify } = require('util');

const whisperAsync = promisify(whisper);

const whisperParamsMock = {
language: "en",
model: path.join(__dirname, "../../../models/ggml-base.en.bin"),
fname_inp: path.join(__dirname, "../../../samples/jfk.wav"),
const commonParams = {
language: 'en',
model: join(__dirname, '../../../models/ggml-base.en.bin'),
fname_inp: join(__dirname, '../../../samples/jfk.wav'),
use_gpu: true,
flash_attn: false,
no_prints: true,
comma_in_time: false,
translate: true,
no_timestamps: false,
detect_language: false,
audio_ctx: 0,
max_len: 0,
prompt: "",
print_progress: false,
progress_callback: (progress) => {
console.log(`Progress: ${progress}`);
},
max_context: -1
max_len: 0
};

describe("Run whisper.node", () => {
test("it should receive a non-empty value", async () => {
let result = await whisperAsync(whisperParamsMock);
console.log(result);
describe('Whisper.cpp Node.js addon with VAD support', () => {
test('Basic whisper transcription without VAD', async () => {
const params = {
...commonParams,
vad: false
};

expect(result['transcription'].length).toBeGreaterThan(0);
}, 10000);
const result = await whisperAsync(params);

expect(typeof result).toBe('object');
expect(Array.isArray(result.transcription)).toBe(true);
expect(result.transcription.length).toBeGreaterThan(0);

// Check that we got some transcription text
const text = result.transcription.map(segment => segment[2]).join(' ');
expect(text.length).toBeGreaterThan(0);
expect(text.toLowerCase()).toContain('ask not');
}, 30000);

test('VAD parameters validation', async () => {
// Test with invalid VAD model - should return empty transcription
const invalidParams = {
...commonParams,
vad: true,
vad_model: 'non-existent-model.bin',
vad_threshold: 0.5
};

// This should handle the error gracefully and return empty transcription
const result = await whisperAsync(invalidParams);
expect(typeof result).toBe('object');
expect(Array.isArray(result.transcription)).toBe(true);
// When VAD model doesn't exist, it should return empty transcription
expect(result.transcription.length).toBe(0);
}, 10000);

test('VAD parameter parsing', async () => {
// Test that VAD parameters are properly parsed (even if VAD model doesn't exist)
const vadParams = {
...commonParams,
vad: false, // Disabled so no model required
vad_threshold: 0.7,
vad_min_speech_duration_ms: 300,
vad_min_silence_duration_ms: 150,
vad_max_speech_duration_s: 45.0,
vad_speech_pad_ms: 50,
vad_samples_overlap: 0.15
};

const result = await whisperAsync(vadParams);

expect(typeof result).toBe('object');
expect(Array.isArray(result.transcription)).toBe(true);
}, 30000);

test('Progress callback with VAD disabled', async () => {
let progressCalled = false;
let lastProgress = 0;

const params = {
...commonParams,
vad: false,
progress_callback: (progress) => {
progressCalled = true;
lastProgress = progress;
expect(progress).toBeGreaterThanOrEqual(0);
expect(progress).toBeLessThanOrEqual(100);
}
};

const result = await whisperAsync(params);

expect(progressCalled).toBe(true);
expect(lastProgress).toBe(100);
expect(typeof result).toBe('object');
}, 30000);

test('Language detection without VAD', async () => {
const params = {
...commonParams,
vad: false,
detect_language: true,
language: 'auto'
};

const result = await whisperAsync(params);

expect(typeof result).toBe('object');
expect(typeof result.language).toBe('string');
expect(result.language.length).toBeGreaterThan(0);
}, 30000);

test('Basic transcription with all VAD parameters set', async () => {
// Test with VAD disabled but all parameters set to ensure no crashes
const params = {
...commonParams,
vad: false, // Disabled so it works without VAD model
vad_model: '', // Empty model path
vad_threshold: 0.6,
vad_min_speech_duration_ms: 200,
vad_min_silence_duration_ms: 80,
vad_max_speech_duration_s: 25.0,
vad_speech_pad_ms: 40,
vad_samples_overlap: 0.08
};

const result = await whisperAsync(params);

expect(typeof result).toBe('object');
expect(Array.isArray(result.transcription)).toBe(true);
expect(result.transcription.length).toBeGreaterThan(0);
}, 30000);
});

Loading
Loading