Is there a way to get much better results with CTC segmentation in French ? #5839

Ca-ressemble-a-du-fake · 2023-01-23T14:42:52Z

Ca-ressemble-a-du-fake
Jan 23, 2023

Hi,

First of all thanks for publishing all that pretrained models and writing detailed information on how TTS / ASR / Model ... work. These doc are very useful to me to learn AI.

I need to align text to audio to generate a French dataset for TTS training. So I followed [CTC segmentation] (https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tools/CTC_Segmentation_Tutorial.ipyn) but the resulting segments are all off the track!
I tried to shorten and widen WINDOW size, try to use different CTC models (stt_fr_conformer_ctc_large and stt_fr_citrinet_1024_gamma_0_25) with the same incorrect alignments (neither the beginning nor the end are correct).

Please note that for most segments the alignment confidence is < -2 (some are -0.05, ...) but even with those confidence no segment is correctly aligned with text.

If it works correctly with Spanish (see examples the Colab example) it should work with French, shouldn't it ?

So what can I do to get much better results ? Here is how I run the ctc segmentation :

python3 /home/caraduf/Nemo/NeMo/tools/ctc_segmentation/scripts/prepare_data.py \
--in_text=/home/caraduf/Tests/Nemo/input \
--output_dir=/home/caraduf/Tests/Nemo/output/processed/ \
--language='other' \
--cut_prefix=3 \
--model=stt_fr_conformer_ctc_large \
--audio_dir=/home/caraduf/Tests/Nemo/input

python3 /home/caraduf/Nemo/NeMo/tools/ctc_segmentation/scripts/run_ctc_segmentation.py \
--output_dir=/home/caraduf/Tests/Nemo/output \
--data=/home/caraduf/Tests/Nemo/output/processed \
--model=stt_fr_conformer_ctc_large \
--window_len=8000

Thanks in advance for your advice

Answered by ekmb

Jan 24, 2023

Hi @Ca-ressemble-a-du-fake , CTC-segmentation takes care of punctuated text, processes and normalizes it. And makes so that for the alignment, only symbols supported by the model are used, so it shouldn't be an issue. I've tried fr_citrinet on a (the first audio from here)[https://librivox.org/compilation-de-poemes-012-by-various/], and it worked well.

A few questions:

How fast is the speech in your audio?
Have you checked the processed audio in /home/caraduf/Tests/Nemo/output/processed/ after resampling --cut_prefix=3? does it sound ok?

CTC-segmentation could struggle with start/end if the speaker talks too fast, but it is usually a few milliseconds issue, not a seconds diff.
If you add …

View full answer

titu1994 · 2023-01-23T19:55:58Z

titu1994
Jan 23, 2023
Maintainer

Like any trained model, alignment is rarely going to be perfect. Usually conformer models can give better alignment but since you have already tried that I'm not sure what else can be tried.

Are you sure your text is punctuated and is not missing text that is spoken (things like uhm uh hmm if not present can all affect alignment calculation)

@ekmb any other advice ?

10 replies

titu1994 Jan 24, 2023
Maintainer

The ground truth must be provided. That's why it's forced alignment.

titu1994 Jan 24, 2023
Maintainer

Also, the most likely alignment doesn't necessarily mean it's the most correct acoustic alignments. Nothing in the loss specifies how it must align sequences, just to minimize the path to the most likely alignment. Likely to a loss function is not 1:1 alignment that humans may want.

Ca-ressemble-a-du-fake Jan 24, 2023
Author

I've just checked out the NFA on the first sentence and it was nearly perfect. Weird, right ? I used it with citrinet (as I did with ctc segmentation tool) and provided the sentence with punctuation (I don't know if the CTC segmentation tool actually uses "with_punct", or "with_punctuation_normalized").

If I compare with ctc segmentation I had the following timestamps : "Start 0.07993884202453988 End 5.475810678680982" whereas for NFA I have something correct with "Start 0.08 End 6.80 (+ duration of 19.52)". While listening in Audacity it sounds perfect.

But how to read the duration provided by the NFA tool ? The manual says "duration in samples" but it should be integers. For example I get for the starting word "0.08 0.08" and for the last word "6.80 19.52" .

Edit : quoting from the CTC segmentation notebook

The .txt file without punctuation contains preprocessed text phrases that we're going to align within the audio file. Here, we split the text into sentences. Each line should contain a text snippet for alignment.

So this is why the CTC segmentation did not work for me, it should use the text with punctuation, shouldn't it ?

titu1994 Jan 24, 2023
Maintainer

NeMo forced aligner and CTC segmentation uses the models underlying tokenizer for alignment process - Nemo models are not trained with punctuation as of now (this will change soon), so you must reformat your text to have only lower case alphabet, and apostrophe and perhaps the minimal set of French vocab tokens. Only then will alignment will be successful. Otherwise you'll get values off by some duration.

Another thing about Nemo forced aligned is that it's resolution is highly dependent on the stride of the model. Citrinet has 8x stride, so you'll get timestamps at 80ms intervals. Conformer has 4x stride so you'll get 40 ms offsets. You have to multiply the stride (4 or 8) by 10 ms for computing offset.

Ca-ressemble-a-du-fake Jan 24, 2023
Author

That's weird because I gave NFA text with punctuation and the result was awesome. I will poke around with those two tools now and more samples. I'll try to figure out what those durarions mean. Thanks again for your advices.

titu1994 · 2023-01-24T07:30:47Z

titu1994
Jan 24, 2023
Maintainer

Fyi @erastorgueva-nv maybe some questions here can be used to improve NFA

0 replies

ekmb · 2023-01-24T17:16:57Z

ekmb
Jan 24, 2023
Collaborator

Hi @Ca-ressemble-a-du-fake , CTC-segmentation takes care of punctuated text, processes and normalizes it. And makes so that for the alignment, only symbols supported by the model are used, so it shouldn't be an issue. I've tried fr_citrinet on a (the first audio from here)[https://librivox.org/compilation-de-poemes-012-by-various/], and it worked well.

A few questions:

How fast is the speech in your audio?
Have you checked the processed audio in /home/caraduf/Tests/Nemo/output/processed/ after resampling --cut_prefix=3? does it sound ok?

CTC-segmentation could struggle with start/end if the speaker talks too fast, but it is usually a few milliseconds issue, not a seconds diff.
If you add fr (here)[https://github.com/NVIDIA/NeMo/blob/main/tools/ctc_segmentation/scripts/prepare_data.py#L50], it will run a basic text normalization, but even with "other" it should work.

2 replies

Ca-ressemble-a-du-fake Jan 24, 2023
Author

Thank you @ekmb for your reply. The speech speed is normal (not fast at all). I did not listen to the resampled audio since it was already 16kHz but still this is a good idea, I will check it. I will also try the audio file you tested. The text seems correctly normalized (all lower case, no punctuation present, and the other files look alike with same punctuation in both of them).

I'll keep you posted !

Ca-ressemble-a-du-fake Jan 25, 2023
Author

@ekmb I tried the librivox example you took and ... silly me! I used --cut-prefix=3 as shown in the first tutorial that does not appear in the updated tutorial.

That's why everything was shifted of 3 seconds! Now that I removed this option, the segmentation works great!

So now it is solved. Thank you for guiding me through this!

Is there a way to get much better results with CTC segmentation in French ? #5839

Uh oh!

Ca-ressemble-a-du-fake Jan 23, 2023

Replies: 3 comments · 12 replies

Uh oh!

titu1994 Jan 23, 2023 Maintainer

Uh oh!

titu1994 Jan 24, 2023 Maintainer

Uh oh!

titu1994 Jan 24, 2023 Maintainer

Uh oh!

Uh oh!

Ca-ressemble-a-du-fake Jan 24, 2023 Author

Uh oh!

titu1994 Jan 24, 2023 Maintainer

Uh oh!

Ca-ressemble-a-du-fake Jan 24, 2023 Author

Uh oh!

titu1994 Jan 24, 2023 Maintainer

Uh oh!

ekmb Jan 24, 2023 Collaborator

Uh oh!

Ca-ressemble-a-du-fake Jan 24, 2023 Author

Uh oh!

Uh oh!

Ca-ressemble-a-du-fake Jan 25, 2023 Author

Ca-ressemble-a-du-fake
Jan 23, 2023

Replies: 3 comments 12 replies

titu1994
Jan 23, 2023
Maintainer

titu1994 Jan 24, 2023
Maintainer

titu1994 Jan 24, 2023
Maintainer

Ca-ressemble-a-du-fake Jan 24, 2023
Author

titu1994 Jan 24, 2023
Maintainer

Ca-ressemble-a-du-fake Jan 24, 2023
Author

titu1994
Jan 24, 2023
Maintainer

ekmb
Jan 24, 2023
Collaborator

Ca-ressemble-a-du-fake Jan 24, 2023
Author

Ca-ressemble-a-du-fake Jan 25, 2023
Author