You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In Riva at least the flashlight decoder returns intermediate results split by stability (in my case 0.1 and 0.9); I understand that a low stability indicates the transcript can change a lot. What bothers me is that when observing the entire intermediate result ([text with stability 0.9] [text with stability 0.1]) one expects words from the start of portion with stability 0.1 will get removed from it and get appended to the end of portion with stability 0.9.
What happens in reality is that often the word in question (even when correct) first disappears or changes completely before repairing in its correct form. This makes a very unpleasant discontinuity in the intermediate transcript. Especially since the portion with stability 0.1 is not short (approx. 2s). In a longer speech there is a missing/changing word approx 2s into the speach.
Any suggestions to how to resolve this issue? I assume this has something to do with the parameters
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I've posted the original question on the Riva forum (https://forums.developer.nvidia.com/t/riva-en-us-when-using-lm-interim-results-with-stability-change-drop-already-predicted-but-less-stable-words/234888) November last, as I know the NeMo GitHub is not the right place for Riva related questions. But since I still haven't received any useful information I'll post a discussion Q&A here as well, if anyone of you can shed some light on this.
In Riva at least the flashlight decoder returns intermediate results split by stability (in my case 0.1 and 0.9); I understand that a low stability indicates the transcript can change a lot. What bothers me is that when observing the entire intermediate result ([text with stability 0.9] [text with stability 0.1]) one expects words from the start of portion with stability 0.1 will get removed from it and get appended to the end of portion with stability 0.9.
What happens in reality is that often the word in question (even when correct) first disappears or changes completely before repairing in its correct form. This makes a very unpleasant discontinuity in the intermediate transcript. Especially since the portion with stability 0.1 is not short (approx. 2s). In a longer speech there is a missing/changing word approx 2s into the speach.
Any suggestions to how to resolve this issue? I assume this has something to do with the parameters
Would a "Cache-aware Streaming Conformer" model help? Or any other? How should these parameters be set for that type of model?
Beta Was this translation helpful? Give feedback.
All reactions