I want to output clear timestamps of the phoneme model. #1862

ideepji · 2025-02-14T06:16:43Z

ideepji
Feb 14, 2025

Hello!

I created a phoneme recognition model using icefall's librispeech training recipe.

I used lexicon.txt where words and phonemes are mapped, and modified the code that processes tokens to get the phoneme index.

When I decoded the learned model in sherpa-onnx, I confirmed that the recognition performance was very good.
However, the timestamps did not seem to match. (I used the ctc model of zipformer2.)

In the past, when I used a token-based librispeech model to generate HLG with only "ICE CREAM" and decoded it, I confirmed that the timestamps matched relatively well.

Since blank was not clearly learned, I also tried training by inserting the !SIL symbol before and after the training data text.
(I mapped the blank symbol to !SIL.)

Also, since I thought that phonemes should be recognized as a more detailed unit than tokens,
I changed the output_downsampling_factor to 1 and set the subsampling_factor to 2 inside Zipformer so that the encoder output would not be halved.

However, both methods did not improve the timestamps.
The phoneme output from CTC and the phoneme output from HLG are the same.

Is there a way to output timestamps that clearly match the beginning and end of words like kaldi?
Please advise.
Thank you.

The first recognition result in the picture below is the phoneme and time information that passed HLG, and the second recognition result is the phoneme and time information that passed only zipformer-CTC.
Please refer to it.

csukuangfj · 2025-02-14T06:19:10Z

csukuangfj
Feb 14, 2025
Maintainer

Please see the comment from Dan at
k2-fsa/icefall#1874 (comment)

CC @danpovey

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

I want to output clear timestamps of the phoneme model. #1862

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

I want to output clear timestamps of the phoneme model. #1862

Uh oh!

Uh oh!

ideepji Feb 14, 2025

Replies: 1 comment

Uh oh!

csukuangfj Feb 14, 2025 Maintainer

ideepji
Feb 14, 2025

csukuangfj
Feb 14, 2025
Maintainer