Skip to content
Discussion options

You must be logged in to vote

RNNT was designed to be a streaming architecture for ASR. During inference, no longer how long the sequence of labels emitted previously, the output Y(y_(t,u) | x_t, y_u-1) where y is the label, x is the audio features, t is acoustic timestep, u is label timestep.

Since Lstm depends on only y_u-1 (conditioned on Lstm state) to predict y_u (ignore x_t for this discussion), it makes it super efficient for audio inference (and can be applied directly to thousands of timesteps over an hour long audio).

This would be intractable for attention decoder because if requires All previous y_u-1 .. y_0 to predict y_u - and takes quadratic cost for every timestep u.

Do note that this is also taking in…

Replies: 2 comments 4 replies

Comment options

You must be logged in to vote
4 replies
@titu1994
Comment options

@maxeduc
Comment options

@titu1994
Comment options

@maxeduc
Comment options

Answer selected by maxeduc
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants