Which encoders and decoders are used for each modality of data

May I ask what audio encoders, video encoders, and corresponding decoders are used in the model? It seems that the paper did not mention.