Is there a way to get much better results with CTC segmentation in French ? #5839
-
Hi, First of all thanks for publishing all that pretrained models and writing detailed information on how TTS / ASR / Model ... work. These doc are very useful to me to learn AI. I need to align text to audio to generate a French dataset for TTS training. So I followed [CTC segmentation] (https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tools/CTC_Segmentation_Tutorial.ipyn) but the resulting segments are all off the track! Please note that for most segments the alignment confidence is < -2 (some are -0.05, ...) but even with those confidence no segment is correctly aligned with text. If it works correctly with Spanish (see examples the Colab example) it should work with French, shouldn't it ? So what can I do to get much better results ? Here is how I run the ctc segmentation :
Thanks in advance for your advice |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 12 replies
-
Like any trained model, alignment is rarely going to be perfect. Usually conformer models can give better alignment but since you have already tried that I'm not sure what else can be tried. Are you sure your text is punctuated and is not missing text that is spoken (things like uhm uh hmm if not present can all affect alignment calculation) @ekmb any other advice ? |
Beta Was this translation helpful? Give feedback.
-
Fyi @erastorgueva-nv maybe some questions here can be used to improve NFA |
Beta Was this translation helpful? Give feedback.
-
Hi @Ca-ressemble-a-du-fake , CTC-segmentation takes care of punctuated text, processes and normalizes it. And makes so that for the alignment, only symbols supported by the model are used, so it shouldn't be an issue. I've tried fr_citrinet on a (the first audio from here)[https://librivox.org/compilation-de-poemes-012-by-various/], and it worked well. A few questions:
CTC-segmentation could struggle with start/end if the speaker talks too fast, but it is usually a few milliseconds issue, not a seconds diff. |
Beta Was this translation helpful? Give feedback.
Hi @Ca-ressemble-a-du-fake , CTC-segmentation takes care of punctuated text, processes and normalizes it. And makes so that for the alignment, only symbols supported by the model are used, so it shouldn't be an issue. I've tried fr_citrinet on a (the first audio from here)[https://librivox.org/compilation-de-poemes-012-by-various/], and it worked well.
A few questions:
/home/caraduf/Tests/Nemo/output/processed/
after resampling --cut_prefix=3? does it sound ok?CTC-segmentation could struggle with start/end if the speaker talks too fast, but it is usually a few milliseconds issue, not a seconds diff.
If you add
…