Replies: 1 comment
-
Please see the comment from Dan at CC @danpovey |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello!
I created a phoneme recognition model using icefall's librispeech training recipe.
I used lexicon.txt where words and phonemes are mapped, and modified the code that processes tokens to get the phoneme index.
When I decoded the learned model in sherpa-onnx, I confirmed that the recognition performance was very good.
However, the timestamps did not seem to match. (I used the ctc model of zipformer2.)
In the past, when I used a token-based librispeech model to generate HLG with only "ICE CREAM" and decoded it, I confirmed that the timestamps matched relatively well.
Since blank was not clearly learned, I also tried training by inserting the !SIL symbol before and after the training data text.
(I mapped the blank symbol to !SIL.)
Also, since I thought that phonemes should be recognized as a more detailed unit than tokens,
I changed the output_downsampling_factor to 1 and set the subsampling_factor to 2 inside Zipformer so that the encoder output would not be halved.
However, both methods did not improve the timestamps.
The phoneme output from CTC and the phoneme output from HLG are the same.
Is there a way to output timestamps that clearly match the beginning and end of words like kaldi?
Please advise.
Thank you.
The first recognition result in the picture below is the phoneme and time information that passed HLG, and the second recognition result is the phoneme and time information that passed only zipformer-CTC.
Please refer to it.
Beta Was this translation helpful? Give feedback.
All reactions