Replies: 1 comment 7 replies
-
Question - was fp16 used during training / inference (by model.half()) ? That might cause attention activation matrix to blow up after finetuning (though odd that it can handle it in the base model that was trained on fp32). Next, can you check what happens when you further enlarge your chunked attention window beyond (166, 165)? Furthermore, how was the model trained and what was the wer on the held out set before and after finetuning? Perhaps the base model performed well enough for the held out set, but fine tuning overfit to the training domain, causing several percent wer increase in limited context (or again, the context was insufficient given it has infinite context at train time). |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
first of all, thank you for the great work releasing all these models, code and tutorials!!
in my tests, I'm able to use medium-ctc-conformer to transcribe offline long audio in one go (e.g. 10 minutes audio file) using cpu (gpu fails due to out-of-memory), with very good word-error-rate. I didn't use any chunking:
'att_context_size': [-1,-1], att_context_style='regular'
this was a very nice surprise overall, given existing streaming machinery for transcribing streaming and offline long audio files.
however, when I finetuned this conformer with several hundred hours (converges nicely on held-out validation set), I observe the following:
chunked_limited
and (for example) `'att_context_size': [166,165] (during inference only, with same finetuned model) I get reasonable word-error-rate, but somewhat worse than with pretrained conformer (by several percent, absolute)finetuning is done on maximum chunks of 20 seconds.
questions:
Beta Was this translation helpful? Give feedback.
All reactions