May I ask what audio encoders, video encoders, and corresponding decoders are used in the model? It seems that the paper did not mention.