Anyone know why Conformer Transducer models tend to use RNN/LSTM-based prediction networks instead of transformers? #6249

maxeduc · 2023-03-18T23:16:20Z

maxeduc
Mar 18, 2023

These models use RNNTDecoder classes for the prediction network, which internally use LSTMs (often single layer by default). This consistent with most papers I've seen on this, but does anyone know why transformers tend not to be used in prediction networks instead? They seem to be replacing LSTMs everywhere else. Also, is using a transformer based prediction network supported in NeMo?

Thank you!

Answered by titu1994

Mar 19, 2023

RNNT was designed to be a streaming architecture for ASR. During inference, no longer how long the sequence of labels emitted previously, the output Y(y_(t,u) | x_t, y_u-1) where y is the label, x is the audio features, t is acoustic timestep, u is label timestep.

Since Lstm depends on only y_u-1 (conditioned on Lstm state) to predict y_u (ignore x_t for this discussion), it makes it super efficient for audio inference (and can be applied directly to thousands of timesteps over an hour long audio).

This would be intractable for attention decoder because if requires All previous y_u-1 .. y_0 to predict y_u - and takes quadratic cost for every timestep u.

Do note that this is also taking in…

View full answer

titu1994 · 2023-03-19T00:04:50Z

titu1994
Mar 19, 2023
Maintainer

RNNT was designed to be a streaming architecture for ASR. During inference, no longer how long the sequence of labels emitted previously, the output Y(y_(t,u) | x_t, y_u-1) where y is the label, x is the audio features, t is acoustic timestep, u is label timestep.

Since Lstm depends on only y_u-1 (conditioned on Lstm state) to predict y_u (ignore x_t for this discussion), it makes it super efficient for audio inference (and can be applied directly to thousands of timesteps over an hour long audio).

This would be intractable for attention decoder because if requires All previous y_u-1 .. y_0 to predict y_u - and takes quadratic cost for every timestep u.

Do note that this is also taking into account that RNNT itself is a monotonic loss - and that speech recognition usually does not depend on looking at all label sequence just to predict the current step (unlike translation or summarization tasks in nlp). In ASR, even with conformer + global attention, the vast majority of current token only depend on last 5-10 label tokens.

In fact, research has gone the other direction - claiming even Lstms state is not very important and instead use just a simple single token or K token look back embedding matrix buffer (without any step). We call this the StatelessRNNTDecoder in Nemo, and surprisingly find that most of the time it gets quite comparable WER as compared to Lstm decoder.

We just prefer Lstm decoder cause it's commonly used in literature and has the potential to do better if the model needs to learn stateful dependencies.

Finally, attention has been used as a decoder in AED (attention encoder decoder) architectures (see Espnet models). It does very well, but is not streamable (at least not without plenty of masking tricks and buffers etc).

4 replies

titu1994 Mar 19, 2023
Maintainer

To note, attention based decoder is not currently planned, and probably is not going to be streamable, defeating its purpose of being an RNNT decoder component.

maxeduc Mar 19, 2023
Author

Thanks for the context! A few thoughts:

the vast majority of current token only depend on last 5-10 label tokens.

This is actually exactly why I was thinking of a transformer. The use case I mainly have in mind is streaming, long running audio transcription. Models are usually trained on relatively short audio sequences, and so I've got some concerns about their ability to extrapolate to an indefinitely long stream. LSTM decoders have hidden state that can carry infinitely long dependencies (in theory), so if you try to train by concatenating audio sequences and carrying state across batches, you're going to have to approximate the gradients (with truncated BPTT).

With a transformer, you can use a fixed past context window of 5 to 10 tokens or whatever is reasonable. Then (if I understand things correctly) you'd be able to perfectly calculate gradients over the limited context window in a simulated infinite streaming setup (with concatenated audio files). NeMo already has the ability to use transformers with limited context windows in a streaming setting I think (the cache-aware streaming conformer can do this AFAIK).

A decoder with a limited context window seems like it would be more stable during long transcriptions than an LSTM where the accumulated state over long periods of time could destabilize.

That said, it sounds like your stateless LSTM would be able to accomplish the same thing.

titu1994 Mar 19, 2023
Maintainer

With a transformer, you can use a fixed past context window of 5 to 10 tokens or whatever is reasonable. Then (if I understand things correctly) you'd be able to perfectly calculate gradients over the limited context window in a simulated infinite streaming setup (with concatenated audio files).

That said, it sounds like your stateless LSTM would be able to accomplish the same thing.

I agree with your last statement. Why do you need transformers for this ? Do note, stateless decoder of just 2 token history length nearly matches LSTM wer on most benchmarks and on plenty of large and small scale datasets. So what will be gained by having a transformer do this over all timesteps, or a limited history of 10 timesteps, that cannot be done by a stateless embedding matrix with 10 token look back ?

Lstms have no issue with long form transcription. We routinely use conformer transducer for transcription of 5-10 mins of speech (and in buffered streaming mode for 30 mins to 1 hour). ContextNet (CNN based transducer) can manager exact one shot offline inference of 2 hours audio - and this is with Lstm. Same can be done with stateless decoder.

Transformer is a great model, attention is a great modelling block, but everything has its use case and RNNT decoder is not one of them cause it defeats the main purpose of RNNT architecture.

Btw, AED models often cannot do inference on even 5 minutes of audio due to the amount of memory required, though I think Espnet recently proposed block attention which makes at least the encoder able to handle long sequences (still cannot manage it in the decoder though).

maxeduc Mar 19, 2023
Author

Thanks, cool to here the LSTM prediction net works well in practice on long form sequences with a conformer transducer, and that there are some cheaper options too.

titu1994 · 2023-03-19T02:18:24Z

titu1994
Mar 19, 2023
Maintainer

As a counterpoint to the fact that stateless decoder with 2 label look back can match LSTM decoder, the one case where attention might benefit is to perform end to end speech translation where you actually do need global context in order to do proper inference.

Then again, there are other ways to instill global information to RNNT in order to do speech translation, as shown by a few papers from Microsoft

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Anyone know why Conformer Transducer models tend to use RNN/LSTM-based prediction networks instead of transformers? #6249

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Anyone know why Conformer Transducer models tend to use RNN/LSTM-based prediction networks instead of transformers? #6249

Uh oh!

maxeduc Mar 18, 2023

Replies: 2 comments · 4 replies

Uh oh!

titu1994 Mar 19, 2023 Maintainer

Uh oh!

titu1994 Mar 19, 2023 Maintainer

Uh oh!

Uh oh!

maxeduc Mar 19, 2023 Author

Uh oh!

titu1994 Mar 19, 2023 Maintainer

Uh oh!

maxeduc Mar 19, 2023 Author

Uh oh!

titu1994 Mar 19, 2023 Maintainer

maxeduc
Mar 18, 2023

Replies: 2 comments 4 replies

titu1994
Mar 19, 2023
Maintainer

titu1994 Mar 19, 2023
Maintainer

maxeduc Mar 19, 2023
Author

titu1994 Mar 19, 2023
Maintainer

maxeduc Mar 19, 2023
Author

titu1994
Mar 19, 2023
Maintainer