Replies: 2 comments 5 replies
-
After digging through the code, it actually seems to me like fuse_loss_wer is not using teacher forcing (which is what I was hoping). It seems to only reuse the encoder outputs from the loss computation. But it would still be good to get confirmation of this. |
Beta Was this translation helpful? Give feedback.
-
There's no teacher forcing in Transducers. A Google search shows plenty of examples of the difference of bptt (back propagation through time) vs teacher forcing for Lstm. Simply passing in ground truth to an Lstm means it's gradient is calculated via bptt. You have to take extra steps to perform teacher forcing. There's quite a bit of literature wrt Lstms for this in machine translation. Encoder and decoder outputs are both used to compute the joint, which then is used for the loss. Inference is done autoregressively with only the encoder states however. You can look at the RNNTDecoding code and the greedy or beam decoding logic for how audio feature alone are used during inference and Prednet is used autoregressively with blank state as init. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm reading through the explanation of the fused loss/WER step for transducers here. Based on the explanation (and as is typical for efficiency reasons), it seems like the loss calculation is using teacher forcing -- feeding in ground truth labels to the prediction network rather than the actual decoded tokens. This is good for loss during training, but what about WER? It seems to be using teacher forcing which seems unusual to me -- is it? Teacher forcing would give an artificially inflated WER from my understanding.
Maybe the WER is calculated with real autoregressive decoding in the validation step? Anyone know?
Beta Was this translation helpful? Give feedback.
All reactions