-
Notifications
You must be signed in to change notification settings - Fork 116
feat: use vad #867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: use vad #867
Conversation
WalkthroughThe changes replace the previous RMS-based and pluggable predictor-based audio chunking logic with a new approach that uses the Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Server
participant Chunker
participant VAD
Client->>Server: Send audio stream
Server->>Chunker: Pass audio stream
Chunker->>VAD: Analyze audio samples for speech
VAD-->>Chunker: Speech/silence labels
Chunker->>Server: Return speech-only chunks
Server-->>Client: Stream transcription results
Assessment against linked issues
Poem
Note ⚡️ AI Code Reviews for VS Code, Cursor, WindsurfCodeRabbit now has a plugin for VS Code, Cursor and Windsurf. This brings AI code reviews directly in the code editor. Each commit is reviewed immediately, finding bugs before the PR is raised. Seamless context handoff to your AI code agent ensures that you can easily incorporate review feedback. ✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
crates/chunker/src/stream.rs (1)
48-50
: 🛠️ Refactor suggestion
as_stream()
is rebuilt on every poll – this can replay the source
let stream = this.source.as_stream();
followed bypin!(stream)
runs on eachpoll_next
call, which
creates a brand-new stream every time.
Ifas_stream()
yields a fresh iterator starting from the beginning (which is typical), the chunker will
loop forever on the first samples or duplicate audio.Store the pinned stream once inside the struct instead:
-pub struct ChunkStream<S: AsyncSource + Unpin> { +pub struct ChunkStream<S: AsyncSource + Unpin, St> { source: S, - vad: VoiceActivityDetector, + stream: Pin<Box<St>>, vad: VoiceActivityDetector, @@ - Self { - source, - vad, - buffer: Vec::new(), - max_duration, - } + let stream = Box::pin(source.as_stream()); + Self { source, stream, vad, buffer: Vec::new(), max_duration }
🧹 Nitpick comments (5)
plugins/local-stt/src/server.rs (1)
149-157
: VAD configuration looks good but could be more robust.The implementation correctly configures the VoiceActivityDetector with appropriate parameters for speech recognition. The sample rate of 16kHz and chunk size of 512 samples are reasonable values for voice detection.
However, using
expect()
for error handling could be improved for production code.Consider handling potential errors more gracefully:
- let vad = VoiceActivityDetector::builder() - .sample_rate(sample_rate) - .chunk_size(512usize) - .build() - .expect("vad config is valid"); + let vad = match VoiceActivityDetector::builder() + .sample_rate(sample_rate) + .chunk_size(512usize) + .build() { + Ok(vad) => vad, + Err(e) => { + tracing::error!("Failed to build VAD: {}", e); + // Either return an error or fall back to default parameters + return; + } + };Additionally, consider making these parameters configurable rather than hardcoded, which would make the system more adaptable to different audio conditions.
crates/chunker/src/lib.rs (2)
9-15
: ReturnChunkStream
without consuming the VAD instanceThe
chunks
method currently takesvad
by value, which moves the detector into theChunkStream
.
That prevents re-using the same configured detector for issuing multiple independent chunk streams from the
same audio source (e.g. different max-duration settings), and also means the caller loses access to the VAD
for any other purpose.If the underlying type implements
Clone
(it does) and isSend + Sync
safe, consider taking the argument
by&VoiceActivityDetector
and cloning internally, or alternatively byimpl Into<VoiceActivityDetector>
to give the caller more flexibility.
48-56
: Avoid panicking when the temp directory does not yet exist
std::fs::remove_dir_all
returnsErrorKind::NotFound
when the directory is missing.
Unwrapping on that error makes the test unnecessarily brittle.- std::fs::remove_dir_all("tmp/english_1").unwrap(); + let _ = std::fs::remove_dir_all("tmp/english_1");(or check for the error explicitly).
crates/chunker/src/stream.rs (2)
79-85
: Duplicated logic – extract helper for final chunk creationThe three identical blocks that convert
speech
into aSamplesBuffer
could be factored into a small helper
to reduce repetition and future maintenance overhead.
90-105
: Expose threshold & padding as parameters
filter_speech_chunks
hard-codes a 0.75 probability threshold and 3-frame padding.
Different environments may require tuning. Consider propagating these values from the public API or from the
builder ofChunkStream
.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (8)
Cargo.toml
(1 hunks)crates/chunker/Cargo.toml
(1 hunks)crates/chunker/src/error.rs
(0 hunks)crates/chunker/src/lib.rs
(3 hunks)crates/chunker/src/predictor.rs
(0 hunks)crates/chunker/src/stream.rs
(3 hunks)plugins/local-stt/Cargo.toml
(1 hunks)plugins/local-stt/src/server.rs
(2 hunks)
💤 Files with no reviewable changes (2)
- crates/chunker/src/error.rs
- crates/chunker/src/predictor.rs
🧰 Additional context used
🧬 Code Graph Analysis (1)
crates/chunker/src/lib.rs (1)
crates/chunker/src/stream.rs (1)
new
(20-27)
🔇 Additional comments (5)
Cargo.toml (1)
196-196
: Added voice_activity_detector dependency looks good.The addition of the voice_activity_detector crate as a workspace dependency aligns perfectly with the PR objective to replace the previous VAD implementation with the Silero VAD model.
crates/chunker/Cargo.toml (1)
19-19
: Appropriate dependency replacement.The change to use the voice_activity_detector crate instead of hypr-vad is consistent with the PR objective to integrate the Silero VAD model for improved audio chunking.
plugins/local-stt/Cargo.toml (1)
53-53
: Consistent dependency addition.Adding voice_activity_detector as a dependency to the local-stt plugin ensures it can use the new VAD implementation directly, which aligns with the changes made in the server.rs file.
plugins/local-stt/src/server.rs (1)
24-24
: Appropriate import for the new VAD implementation.The import of VoiceActivityDetector aligns with the PR objective to use the Silero VAD model.
crates/chunker/src/lib.rs (1)
40-45
: Handle possible sample-rate mismatch between VAD and audioThe VAD is hard-coded to
sample_rate(16_000)
, while theaudio_source
’s real sample-rate is whatever is
encoded inenglish_1::AUDIO_PATH
.
If these differ, the detector’s result quality (and therefore chunking accuracy) will be severely degraded.- let vad = VoiceActivityDetector::builder() - .sample_rate(16000) + let vad = VoiceActivityDetector::builder() + .sample_rate(audio_source.sample_rate())Either assert that the source really is 16 kHz, down-sample the input, or build the VAD with the run-time
value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
crates/chunker/src/lib.rs (1)
48-49
: Error handling in tests could be improvedThe call to
remove_dir_all
will panic if the directory doesn't exist, which could cause test failures in clean environments.Consider handling the case where the directory doesn't exist:
-std::fs::remove_dir_all("tmp/english_1").unwrap(); +let _ = std::fs::remove_dir_all("tmp/english_1"); // Ignore errors if directory doesn't exist
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (8)
Cargo.toml
(1 hunks)crates/chunker/Cargo.toml
(1 hunks)crates/chunker/src/error.rs
(0 hunks)crates/chunker/src/lib.rs
(3 hunks)crates/chunker/src/predictor.rs
(0 hunks)crates/chunker/src/stream.rs
(3 hunks)plugins/local-stt/Cargo.toml
(1 hunks)plugins/local-stt/src/server.rs
(2 hunks)
💤 Files with no reviewable changes (2)
- crates/chunker/src/error.rs
- crates/chunker/src/predictor.rs
🚧 Files skipped from review as they are similar to previous changes (4)
- Cargo.toml
- plugins/local-stt/Cargo.toml
- crates/chunker/Cargo.toml
- plugins/local-stt/src/server.rs
🧰 Additional context used
🧬 Code Graph Analysis (1)
crates/chunker/src/lib.rs (1)
crates/chunker/src/stream.rs (1)
new
(20-27)
🔇 Additional comments (6)
crates/chunker/src/lib.rs (3)
7-7
: Architectural change: New import for VoiceActivityDetectorThis import reflects the transition from the previous generic
Predictor
-based approach to a concreteVoiceActivityDetector
implementation, which aligns with the PR objective of using the voice_activity_detector crate for audio chunking.
10-15
: API change: From generic predictor to concrete VoiceActivityDetectorThe
ChunkerExt
trait'schunks
method has been modified to accept a concreteVoiceActivityDetector
instead of a genericPredictor
type parameter. This change:
- Simplifies the API by making the chunking method more specific
- Reduces flexibility (no longer supports pluggable predictors)
- Aligns with the PR objective of using VAD for chunking
This is a breaking change that clients of this API will need to update.
40-45
: Proper VAD configuration for testThe VAD configuration uses appropriate parameters for the test:
- Sample rate (16000) matches the test WAV spec
- Chunk size (512) is a reasonable value for audio processing
The configuration and initialization looks good.
crates/chunker/src/stream.rs (3)
10-10
: New VAD imports aligned with architectural changeThe import of
IteratorExt
provides extension methods for iterators that enable voice activity detection functionality, aligning with the PR's goal to use the voice_activity_detector crate.
12-17
: Struct refactoring: Removed generic Predictor type parameterThe
ChunkStream
struct has been simplified by removing the genericPredictor
type parameter and replacing it with a concreteVoiceActivityDetector
field. This change is consistent with the architectural shift in lib.rs.
56-59
: Previous issue fixed: Stream no longer terminates on empty speechThe previous implementation had an issue where empty speech would terminate the stream prematurely by returning
Poll::Ready(None)
. The new implementation correctly returnsSome(buffer)
even if the speech is empty, allowing the stream to continue processing. This addresses the issue mentioned in the previous review.Also applies to: 62-65, 71-73
// helper function to filter speech chunks | ||
fn filter_speech_chunks<D: IntoIterator<Item = f32>>( | ||
vad: &mut VoiceActivityDetector, | ||
data: D, | ||
) -> Vec<f32> { | ||
data.into_iter() | ||
.label(vad, 0.75, 3) | ||
.filter_map(|label| { | ||
if label.is_speech() { | ||
Some(label.into_iter()) | ||
} else { | ||
None | ||
} | ||
}) | ||
.flatten() | ||
.collect() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve the speech filtering implementation
The new helper function has several areas for improvement:
- Hardcoded parameters (0.75 threshold, 3 window size) should be configurable
- No error handling for potential VAD failures
- The empty speech case could return
None
to allow callers to skip empty chunks
Consider refactoring to:
// helper function to filter speech chunks
fn filter_speech_chunks<D: IntoIterator<Item = f32>>(
vad: &mut VoiceActivityDetector,
data: D,
+ threshold: f32,
+ window_size: usize,
) -> Vec<f32> {
let speech = data.into_iter()
- .label(vad, 0.75, 3)
+ .label(vad, threshold, window_size)
.filter_map(|label| {
if label.is_speech() {
Some(label.into_iter())
} else {
None
}
})
.flatten()
.collect();
+ speech
}
Also, consider updating the calling code to handle empty speech vectors appropriately.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In crates/chunker/src/stream.rs between lines 77 and 93, the
filter_speech_chunks function uses hardcoded parameters for the VAD threshold
and window size, lacks error handling for VAD failures, and always returns a Vec
even if empty. Refactor the function to accept threshold and window size as
parameters to make it configurable, add error handling to manage potential VAD
failures gracefully, and change the return type to Option<Vec<f32>> so it
returns None when no speech chunks are detected. Also, update any calling code
to handle the Option return type properly, skipping processing when None is
returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
crates/chunker/src/stream.rs (2)
63-71
: Consider skipping instead of terminating when buffer is emptyWhen the source stream ends with leftover data but no speech is detected, the current implementation returns
Poll::Ready(None)
, which signals the end of the stream. This might prematurely terminate processing if there's more data to come.Replace the early-return with a different approach to avoid terminating the stream prematurely:
Poll::Ready(None) if !this.buffer.is_empty() => { let data = std::mem::take(&mut this.buffer); let speech = filter_speech_chunks(&mut this.vad, data); - if speech.is_empty() { - return Poll::Ready(None); - } else { + if !speech.is_empty() { return Poll::Ready(Some(SamplesBuffer::new(1, sample_rate, speech))); } + // No speech detected in the final buffer, just signal end of stream + return Poll::Ready(None); }
87-103
: 🛠️ Refactor suggestionMake VAD parameters configurable
The
filter_speech_chunks
function uses hardcoded values for the threshold (0.75) and window size (3) in the VAD processing. These should be configurable parameters to allow tuning for different audio conditions and use cases.// helper function to filter speech chunks fn filter_speech_chunks<D: IntoIterator<Item = f32>>( vad: &mut VoiceActivityDetector, data: D, + threshold: f32, + window_size: usize, ) -> Vec<f32> { data.into_iter() - .label(vad, 0.75, 3) + .label(vad, threshold, window_size) .filter_map(|label| { if label.is_speech() { Some(label.into_iter()) } else { None } }) .flatten() .collect() }Then update the call sites to provide appropriate values:
- let speech = filter_speech_chunks(&mut this.vad, data); + let speech = filter_speech_chunks(&mut this.vad, data, 0.75, 3);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (8)
Cargo.toml
(1 hunks)crates/chunker/Cargo.toml
(1 hunks)crates/chunker/src/error.rs
(0 hunks)crates/chunker/src/lib.rs
(3 hunks)crates/chunker/src/predictor.rs
(0 hunks)crates/chunker/src/stream.rs
(3 hunks)plugins/local-stt/Cargo.toml
(1 hunks)plugins/local-stt/src/server.rs
(2 hunks)
💤 Files with no reviewable changes (2)
- crates/chunker/src/error.rs
- crates/chunker/src/predictor.rs
🚧 Files skipped from review as they are similar to previous changes (5)
- Cargo.toml
- crates/chunker/Cargo.toml
- plugins/local-stt/Cargo.toml
- plugins/local-stt/src/server.rs
- crates/chunker/src/lib.rs
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: ci (windows, windows-latest)
- GitHub Check: ci
- GitHub Check: ci (macos, macos-latest)
🔇 Additional comments (2)
crates/chunker/src/stream.rs (2)
10-10
: LGTM: Well-implemented VAD integrationThe changes to replace the generic
Predictor
with a concreteVoiceActivityDetector
are implemented cleanly. The structure modifications and import changes are done consistently across the file.Also applies to: 12-14, 19-23
55-61
: Properly handling empty speech chunksThe implementation now correctly handles empty speech chunks by checking
if !speech.is_empty()
before returning a buffer, which addresses the issue mentioned in previous reviews.
let data = this.buffer.drain(0..max_samples); | ||
let speech = filter_speech_chunks(&mut this.vad, data); | ||
if speech.is_empty() { | ||
Poll::Pending | ||
} else { | ||
Poll::Ready(Some(SamplesBuffer::new(1, sample_rate, speech))) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve handling of empty speech during max duration flush
When draining the buffer due to max duration with no speech detected, the current implementation returns Poll::Pending
. This might cause the stream to stall if there's more audio to process.
Consider continuing to poll rather than returning Poll::Pending
when no speech is detected:
let data = this.buffer.drain(0..max_samples);
let speech = filter_speech_chunks(&mut this.vad, data);
if speech.is_empty() {
- Poll::Pending
+ // Continue polling instead of waiting
+ cx.waker().wake_by_ref();
+ Poll::Pending
} else {
Poll::Ready(Some(SamplesBuffer::new(1, sample_rate, speech)))
}
Alternatively, consider collecting more data before returning to the caller:
let data = this.buffer.drain(0..max_samples);
let speech = filter_speech_chunks(&mut this.vad, data);
if speech.is_empty() {
- Poll::Pending
+ // Continue with the next loop iteration to collect more data
+ continue;
} else {
Poll::Ready(Some(SamplesBuffer::new(1, sample_rate, speech)))
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
let data = this.buffer.drain(0..max_samples); | |
let speech = filter_speech_chunks(&mut this.vad, data); | |
if speech.is_empty() { | |
Poll::Pending | |
} else { | |
Poll::Ready(Some(SamplesBuffer::new(1, sample_rate, speech))) | |
} | |
let data = this.buffer.drain(0..max_samples); | |
let speech = filter_speech_chunks(&mut this.vad, data); | |
if speech.is_empty() { | |
// Continue polling instead of waiting | |
cx.waker().wake_by_ref(); | |
Poll::Pending | |
} else { | |
Poll::Ready(Some(SamplesBuffer::new(1, sample_rate, speech))) | |
} |
🤖 Prompt for AI Agents
In crates/chunker/src/stream.rs around lines 77 to 83, the code returns
Poll::Pending when no speech is detected after draining the buffer, which can
cause the stream to stall. Instead of returning Poll::Pending, modify the logic
to continue polling for more data or handle the empty speech case by either
skipping the current chunk and polling again or buffering more data before
returning. This ensures the stream does not stall and continues processing
incoming audio.
try to resolve #857
Since the crate voice_activity_detector has already nicely wrapped silero based vad, I thought we could just use it, and I tried to write this PR, it introduces changes to the
chunker
crate and related components, replacing the oldPredictor
-based approach with a newvad
implementation for processing audio streams. unit testtest_chunker
to generate the wav file that sounds good.Summary by CodeRabbit
New Features
Refactor
Chores