Finetuning a causal model from a non-causal checkpoint #14597

pointersbad · 2025-08-28T07:02:29Z

pointersbad
Aug 28, 2025

Hello!
First off, thanks for the awesome work; NeMo is a great tool for research and setting up experiments and I enjoy using it very much.
I want to create a causal ASR model for Kyrgyz, and, because of the data scarcity, I thought it's a good idea to load the stt_kk_ru_fastconformer_hybrid_large, as it was trained on ~1k hours of Kazakh, which is very close to my target language. I have changed the config to make the model causal with limited lookahead, then loaded the weights where possible (only partially loaded the pre_encoder weights due to the size mismatch after non-causal -> causal convolutions conversion). After training the model on 32h of Kyrgyz data, I get the validation WER of around 35%, while the original model WER on MCV-test is ~15%.
Now I'm wondering: is it an expected outcome as the offline and streaming task differ and the attention layers can adapt just that much on 32 hours, or did I miss anything during my training? Has anyone conducted experiments like this and what was your experience?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Finetuning a causal model from a non-causal checkpoint #14597

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Finetuning a causal model from a non-causal checkpoint #14597

Uh oh!

pointersbad Aug 28, 2025

Replies: 0 comments

pointersbad
Aug 28, 2025