ctc conformer inference on long audio: pretrained vs finetuned #5511

levhaikin · 2022-11-27T10:47:07Z

levhaikin
Nov 27, 2022

first of all, thank you for the great work releasing all these models, code and tutorials!!

in my tests, I'm able to use medium-ctc-conformer to transcribe offline long audio in one go (e.g. 10 minutes audio file) using cpu (gpu fails due to out-of-memory), with very good word-error-rate. I didn't use any chunking:
'att_context_size': [-1,-1], att_context_style='regular'

this was a very nice surprise overall, given existing streaming machinery for transcribing streaming and offline long audio files.

however, when I finetuned this conformer with several hundred hours (converges nicely on held-out validation set), I observe the following:

if I segment the long audio file by annotated gold boundaries, then word-error-rate is great
if I use the finetuned conformer and feed the entire audio file, I get catastrophic failure with 99% word-error-rate
if I use chunked_limited and (for example) `'att_context_size': [166,165] (during inference only, with same finetuned model) I get reasonable word-error-rate, but somewhat worse than with pretrained conformer (by several percent, absolute)

finetuning is done on maximum chunks of 20 seconds.

questions:

is it surprising that conformer can handle long audio and give great word-error-rate out of the box?
if not, then what might be the reasons for finetuned not being able to reproduce these results, and instead fails catastrophically (not even marginally worse)?

titu1994 · 2022-11-27T11:03:41Z

titu1994
Nov 27, 2022
Maintainer

It is not surprising. Conformer memory has always been the bottleneck for sequence length, not it's accuracy. It is surprising that CPU can handle it, it just exhaust upwards of 64 GB of ram, probably, for 10 m long audio.
the fact that the same trained model when switched to limited context (albeit quite large context) seems to work fine probably means the model is hitting some error in it's activation when computing the full sequence.

Question - was fp16 used during training / inference (by model.half()) ? That might cause attention activation matrix to blow up after finetuning (though odd that it can handle it in the base model that was trained on fp32).

Next, can you check what happens when you further enlarge your chunked attention window beyond (166, 165)?

Furthermore, how was the model trained and what was the wer on the held out set before and after finetuning? Perhaps the base model performed well enough for the held out set, but fine tuning overfit to the training domain, causing several percent wer increase in limited context (or again, the context was insufficient given it has infinite context at train time).

7 replies

levhaikin Nov 27, 2022
Author

I agree. I will probably use some kind of segmentation for long audio (likely by silence, and by some max duration if silence wasn't detected for too long). it just bugs me that the original model works fine on long audio, while the finetuned catastrophically fails (although somewhat mitigated by non-unlimited attention context window). it just makes me suspicious of the finetuning process.

from my experiments, quartznet can process up to an hour-long audio with no problems on gpu, and even using batch of several audio files, so in that sense conformer loses some points.

btw, regarding internal LM in ctc, I think there still is some language-modeling there even though it's not autoregressive like the transducer version.

titu1994 Nov 27, 2022
Maintainer

QuartzNet, Citrinet and ContextNet are all purely convolutional models, and are very memory efficient. With Amp and 40 GB ram I can manage around 2 hours of audio in single forward pass on the GPU.

We're going to add Longformer style attention in some months (fyi @VahidooX) which will support upto 20 - 30 mins of audio on Conformer with sufficient gpu memory, so the issue should go away hopefully but indeed, conformer being wasteful in memory is a downside to it's strong accuracy.

Also agreed on internal LM without autoregressive decoding. Subword on their own do model multiple chars, even if seperate subword is conditionally independent, and the model has global context so it does learn some way to map info from future for current token.

levhaikin Nov 28, 2022
Author

that's good to know. given that contextnet looks very competitive compared to conformer, would you recommend using it instead?

titu1994 Nov 28, 2022
Maintainer

Yes, if long files are important, you can get decent results with ContextNet (RNNT) / Citrinet (CTC) quite easily. It won't be as good as conformer CTC, but you won't lose several absolute % wer at least

levhaikin Nov 28, 2022
Author

ok. thanks for your help.
if I eventually do find the root cause for the original issue, I will share my findings here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ctc conformer inference on long audio: pretrained vs finetuned #5511

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

ctc conformer inference on long audio: pretrained vs finetuned #5511

Uh oh!

levhaikin Nov 27, 2022

Replies: 1 comment · 7 replies

Uh oh!

titu1994 Nov 27, 2022 Maintainer

Uh oh!

levhaikin Nov 27, 2022 Author

Uh oh!

titu1994 Nov 27, 2022 Maintainer

Uh oh!

levhaikin Nov 28, 2022 Author

Uh oh!

titu1994 Nov 28, 2022 Maintainer

Uh oh!

levhaikin Nov 28, 2022 Author

levhaikin
Nov 27, 2022

Replies: 1 comment 7 replies

titu1994
Nov 27, 2022
Maintainer

levhaikin Nov 27, 2022
Author

titu1994 Nov 27, 2022
Maintainer

levhaikin Nov 28, 2022
Author

titu1994 Nov 28, 2022
Maintainer

levhaikin Nov 28, 2022
Author