Replies: 1 comment
-
Deciding which implementation is better based on a single example is hard.
AFAIK, the sampling strategy in |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I've just started playing with this tech and my audio samples are episodes of "Yes, Minister" a BBC sitcom from the 1980'ies that I rather love. The audio is rather clean: Just about everything is indoors with no background noise and no-one speaking on top of each other.
I have a 12 threaded Ryzen 2600 CPU, and no GPU. I've used the medium.en model with beam=5. I've not fiddled with any other options for the different implementations I've tried.
I tested whisper, whisper-cpp and whisper-faster. Whisper-faster was about "realtime". Whisper-cpp was about realtime x 2. I neglected to time whisper.
Whisper and whisper-faster were fair enough. Some simplified and Americanized language. Had problems catching things said in the background.
Whisper-cpp it produced astonishingly good subs (for my sample): All the words, no Americanizations, text of the (significant) sound from TVs and "[phone rings]", "[music]".
Having read the paper and looked around I still feel quite confused about how the same model could produce such different results. Is it all down to the inference engine? Or are there runtime parameters I could experiment with?
Beta Was this translation helpful? Give feedback.
All reactions