The problem of differences between text and speech content in the English dataset in the paper

Hello, when you are training with VoiceAssistant-400K data, do you use the discrete snac tokens provided by it to synthesize speech? When I use snac to synthesize speech, I find that the snac tokens provided by VoiceAssistant-400K are not complete, and the synthesized speech content is shorter than the answer text. Have you  done some speech-text alignment processing during training?