why pad_or_trim use 1000 rather than 3000 when transcribe_audio? `mel = pad_or_trim(mel, 1000).to(model.device).to(dtype)`