Do transducers use teacher forcing for "fuse_loss_wer"-derived WER? #6295

maxmilne · 2023-03-26T18:57:33Z

maxmilne
Mar 26, 2023

I'm reading through the explanation of the fused loss/WER step for transducers here. Based on the explanation (and as is typical for efficiency reasons), it seems like the loss calculation is using teacher forcing -- feeding in ground truth labels to the prediction network rather than the actual decoded tokens. This is good for loss during training, but what about WER? It seems to be using teacher forcing which seems unusual to me -- is it? Teacher forcing would give an artificially inflated WER from my understanding.

Maybe the WER is calculated with real autoregressive decoding in the validation step? Anyone know?

maxmilne · 2023-03-26T19:34:34Z

maxmilne
Mar 26, 2023
Author

After digging through the code, it actually seems to me like fuse_loss_wer is not using teacher forcing (which is what I was hoping). It seems to only reuse the encoder outputs from the loss computation. But it would still be good to get confirmation of this.

0 replies

titu1994 · 2023-03-26T19:56:05Z

titu1994
Mar 26, 2023
Maintainer

There's no teacher forcing in Transducers. A Google search shows plenty of examples of the difference of bptt (back propagation through time) vs teacher forcing for Lstm. Simply passing in ground truth to an Lstm means it's gradient is calculated via bptt. You have to take extra steps to perform teacher forcing. There's quite a bit of literature wrt Lstms for this in machine translation.

Encoder and decoder outputs are both used to compute the joint, which then is used for the loss. Inference is done autoregressively with only the encoder states however. You can look at the RNNTDecoding code and the greedy or beam decoding logic for how audio feature alone are used during inference and Prednet is used autoregressively with blank state as init.

5 replies

maxmilne Mar 26, 2023
Author

Teacher forcing according to Wikipedia is just "feeding observed sequence values (i.e. ground-truth samples) back into the RNN after each step, thus forcing the RNN to stay close to the ground-truth sequence." That's the only definition I can find from other sites too. If that's the case, it seems like loss is computed with ground truth labels rather than the outputs of previous timesteps, so that would be teacher forcing. I don't see how teacher forcing and BPTT are mutually exclusive. I could be wrong here or maybe it's just a difference of terminology.

titu1994 Mar 26, 2023
Maintainer

loss is computed with ground truth labels rather than the outputs of previous timesteps

Loss by definition has to always be calculated on ground truth samples. The difference is only in how you calculate the estimate that is generated by the model which is compared to the ground truth.

BPTT estimate - models output at timestep T-1 + state at T-1 to generate estimate at T
Teacher forcing - ground truth at timestep T-1 + state at T-1 to generate estimate at T.

They may appear similar but bptt does not depend on ground truth label during estimate calculation vs teacher forcing does. This is also why inference with teacher force training is much more unstable - you don't have the ground truth anymore at timestep T-1, so you have to imitate it with whatever your model generated at T-1, but this is not ground truth so ofc the predictions will regress as time goes on due to accumulation of errors.

maxmilne Mar 26, 2023
Author

Teacher forcing - ground truth at timestep T-1 + state at T-1 to generate estimate at T.

Is this not exactly what we do for transducers? The ground truth labels are fed into the prediction network LSTM all at once. Instead of using the model's estimate at T-1 as the next input to the LSTM, you give it the ground truth. I.e. ground truth at T-1 generates the estimate at T. We effectively ignore the softmax distribution at T-1 when feeding in the next token to the LSTM (not to say it isn't used -- it's of course still used for loss computation).

I suppose the loss calculation and forward-backward algorithm wouldn't work otherwise (since you need a grid of timesteps x ground truth labels to calculate the P(ground truth|your distribution)), but this still sounds like teacher forcing to me. See e.g. references like this one: "The ground truth label sequence is used to train the decoder
(teacher forcing) in the training phase, and the previous nonblank output of the transducer is used as the input to decoder
during inference."

titu1994 Mar 26, 2023
Maintainer

Please refer to http://karpathy.github.io/2015/05/21/rnn-effectiveness/ to see how recurrent neural networks are trained.

According to your explanation, even transformers are being trained using teacher forcing simply cause they use the ground truth during forward pass. To extrapolate, any autoregressive system by default is using teacher forcing. They do not and this is incorrect. Please review some papers on machine translation from 2016+2018 to see what teacher forcing is and how it is different from bptt.

Also to note, bptt is also no longer used with transformers because they do not require autoregression during training (forward pass with mask calculates all states and output feature in one step)

maxmilne Mar 26, 2023
Author

According to your explanation, even transformers are being trained using teacher forcing simply cause they use the ground truth during forward pass. To extrapolate, any autoregressive system by default is using teacher forcing.

That's my understanding, yeah. Ground truth fed in during the forward pass would be teacher forcing, whereas ground truth only used for directly calculating the loss would not. See e.g. this Tensorflow transformer tutorial: "This setup is called "teacher forcing" because regardless of the model's output at each timestep, it gets the true value as input for the next timestep. This is a simple and efficient way to train a text generation model. It's efficient because you don't need to run the model sequentially, the outputs at the different sequence locations can be computed in parallel."

Regardless, it doesn't sound like we're disagreeing on what's actually happening inside the models at all, it's just on what the term refers to.

Do transducers use teacher forcing for "fuse_loss_wer"-derived WER? #6295

Uh oh!

maxmilne Mar 26, 2023

Replies: 2 comments · 5 replies

Uh oh!

maxmilne Mar 26, 2023 Author

Uh oh!

titu1994 Mar 26, 2023 Maintainer

Uh oh!

Uh oh!

maxmilne Mar 26, 2023 Author

Uh oh!

Uh oh!

titu1994 Mar 26, 2023 Maintainer

Uh oh!

maxmilne Mar 26, 2023 Author

Uh oh!

titu1994 Mar 26, 2023 Maintainer

Uh oh!

Uh oh!

maxmilne Mar 26, 2023 Author

maxmilne
Mar 26, 2023

Replies: 2 comments 5 replies

maxmilne
Mar 26, 2023
Author

titu1994
Mar 26, 2023
Maintainer

maxmilne Mar 26, 2023
Author

titu1994 Mar 26, 2023
Maintainer

maxmilne Mar 26, 2023
Author

titu1994 Mar 26, 2023
Maintainer

maxmilne Mar 26, 2023
Author