-
These models use RNNTDecoder classes for the prediction network, which internally use LSTMs (often single layer by default). This consistent with most papers I've seen on this, but does anyone know why transformers tend not to be used in prediction networks instead? They seem to be replacing LSTMs everywhere else. Also, is using a transformer based prediction network supported in NeMo? Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
RNNT was designed to be a streaming architecture for ASR. During inference, no longer how long the sequence of labels emitted previously, the output Y(y_(t,u) | x_t, y_u-1) where y is the label, x is the audio features, t is acoustic timestep, u is label timestep. Since Lstm depends on only y_u-1 (conditioned on Lstm state) to predict y_u (ignore x_t for this discussion), it makes it super efficient for audio inference (and can be applied directly to thousands of timesteps over an hour long audio). This would be intractable for attention decoder because if requires All previous y_u-1 .. y_0 to predict y_u - and takes quadratic cost for every timestep u. Do note that this is also taking into account that RNNT itself is a monotonic loss - and that speech recognition usually does not depend on looking at all label sequence just to predict the current step (unlike translation or summarization tasks in nlp). In ASR, even with conformer + global attention, the vast majority of current token only depend on last 5-10 label tokens. In fact, research has gone the other direction - claiming even Lstms state is not very important and instead use just a simple single token or K token look back embedding matrix buffer (without any step). We call this the StatelessRNNTDecoder in Nemo, and surprisingly find that most of the time it gets quite comparable WER as compared to Lstm decoder. We just prefer Lstm decoder cause it's commonly used in literature and has the potential to do better if the model needs to learn stateful dependencies. Finally, attention has been used as a decoder in AED (attention encoder decoder) architectures (see Espnet models). It does very well, but is not streamable (at least not without plenty of masking tricks and buffers etc). |
Beta Was this translation helpful? Give feedback.
-
As a counterpoint to the fact that stateless decoder with 2 label look back can match LSTM decoder, the one case where attention might benefit is to perform end to end speech translation where you actually do need global context in order to do proper inference. Then again, there are other ways to instill global information to RNNT in order to do speech translation, as shown by a few papers from Microsoft |
Beta Was this translation helpful? Give feedback.
RNNT was designed to be a streaming architecture for ASR. During inference, no longer how long the sequence of labels emitted previously, the output Y(y_(t,u) | x_t, y_u-1) where y is the label, x is the audio features, t is acoustic timestep, u is label timestep.
Since Lstm depends on only y_u-1 (conditioned on Lstm state) to predict y_u (ignore x_t for this discussion), it makes it super efficient for audio inference (and can be applied directly to thousands of timesteps over an hour long audio).
This would be intractable for attention decoder because if requires All previous y_u-1 .. y_0 to predict y_u - and takes quadratic cost for every timestep u.
Do note that this is also taking in…