[Question] How to introduce VAD to solve the problem of hallucinations

Background:
I've noticed that when processing audio files containing silent or non-speech segments, Whisper tends to generate hallucinatory content. This not only affects the segments with silence or non-human voices but also seems to impact the subsequent normal speech parts in the audio.

Inquiry:
Given that this is an inherent issue with Whisper, I am curious to know if it's feasible to incorporate strategies similar to VAD in Whisper-turbo. I am aware of approaches like those used in projects such as WhisperX, which seem to effectively mitigate such issues.

Thank you for your time and the incredible work on this project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] How to introduce VAD to solve the problem of hallucinations #54

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Question] How to introduce VAD to solve the problem of hallucinations #54

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions