[HELP NEEDED]: Evaluating different methods of getting Word and Utterance (sentence) level Timestamps from NeMo models #5170

ishansharma1320 · 2022-10-14T17:00:19Z

ishansharma1320
Oct 14, 2022

Hi,

Thanks for reading this message.

I have been recently experimenting with NeMo to extract word-level and utterance-level timestamps for text transcriptions.

Based on information available as tutorials and discussion/PR threads, I was able to gather the following methods:

Word Level Timestamps

Getting Word Timestamps by following #4342 and this comment at issue 4629

In this, we get word timestamps by multiplying stride and preprocessor.window_stride with both start and end offset available in the word attribute from the hypothesis object generated after inferencing on audio.
This is available for both CTC as well as RNN-T based architectures.

Getting Word Timestamps by following ASR_withSpeakerDiarization.ipynb
This is only available for the following models
stt_en_quartznet15x5,
stt_en_citrinet*,
stt_en_conformer_ctc-large
Using Montreal Forced Aligner as mentioned in Discussion [2657]
(Forced Alignment #2657 (reply in thread))

Utterance (sentence) Level Timestamps

There is also a way of getting utterance-level timestamps by converting the RTTM file obtained after Speaker Diarization Inference into an input manifest file for speech transcription to be done down the line.

Based on the above information,

I have the following doubts,

Out of the 3 methods mentioned above, which is more accurate for word-level timestamps that can be implemented for both CTC and RNN-T architectures and are there any other approaches to achieve the same that are more accurate?
For utterance level timestamps, is the approach correct and are there any other approaches to achieve the same that are more accurate?

Any advice/insights regarding this will be highly appreciated.

Thanks Again
Ishan

tango4j · 2022-11-22T23:29:27Z

tango4j
Nov 22, 2022
Collaborator

For the dataset we have tested, stt_en_conformer_ctc_large showed the lowest error for ASR with diarization
With the default settings, it is stt_en_conformer_ctc_large .

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[HELP NEEDED]: Evaluating different methods of getting Word and Utterance (sentence) level Timestamps from NeMo models #5170

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[HELP NEEDED]: Evaluating different methods of getting Word and Utterance (sentence) level Timestamps from NeMo models #5170

Uh oh!

Uh oh!

ishansharma1320 Oct 14, 2022

Replies: 1 comment

Uh oh!

tango4j Nov 22, 2022 Collaborator

ishansharma1320
Oct 14, 2022

tango4j
Nov 22, 2022
Collaborator