Handling Multi Pronunciation Words in ASR #3656

ahkarami · 2022-02-12T18:13:08Z

ahkarami
Feb 12, 2022

Hi,
There are some words in many languages that have multiple pronunciations. For example, the word "twenty" in English. Many speakers may or may not pronounce the second letter ‘t’ in the twenty word. Or, in some other cases the formal writing style of a word differs from its real pronunciation form (e.g., wanna, gonna, etc.). So, in these cases what is the appropriate way to handle these words in our training/test data set?
For example one simple solution would be that we can write all words as the formal writing style (without taking any consideration of its real pronunciation). I have investigated the test-clean LibriSpeech data set, and I have founded that in this data set all of words wrote as the formal words (e.g., twenty, want to, etc.).
My another question is that, writing all of the words as their formal style, will not have a bad effect on the acoustic model?
Best

Answered by titu1994

Feb 13, 2022

So, in these cases what is the appropriate way to handle these words in our training/test data set?

You would leave them as is most of the time. Acoustically, these are common speech patterns and the model should learn that the text should be represented as such. Of course, pronunciation of speech depends on the dataset, optimizing for just American English and expecting it to work with say British/Scottish English is unrealistic. The model will only be robust to accents it was trained on.

We train on some 12000 hours of speech, and a lot of it is not formal speech, we do normalization for numerics and remove punctuation and capitalization, but keep the rest and it seems to work quite w…

View full answer

titu1994 · 2022-02-13T08:20:35Z

titu1994
Feb 13, 2022
Maintainer

So, in these cases what is the appropriate way to handle these words in our training/test data set?

You would leave them as is most of the time. Acoustically, these are common speech patterns and the model should learn that the text should be represented as such. Of course, pronunciation of speech depends on the dataset, optimizing for just American English and expecting it to work with say British/Scottish English is unrealistic. The model will only be robust to accents it was trained on.

We train on some 12000 hours of speech, and a lot of it is not formal speech, we do normalization for numerics and remove punctuation and capitalization, but keep the rest and it seems to work quite well.

3 replies

ahkarami Feb 13, 2022
Author

@titu1994
Thank you very much for your complete and useful explanation.
I think I understand your answer. However, I will ask my questions a little more precisely (to make sure I understand your point correctly).
In your opinion, what do you think it is better to write a single word that is conceptually a word (like the word twenty that the formal writing style of it in some cases differ from its real pronunciation) in the human annotated text for a data set (what is the best writing form for such words)? It seems to me to be one of the following 3 modes:
1- To be written officially (formally) and uniformly everywhere? [e.g., write in all of the data set as "twenty"]
2- Or be written for each audio file as it is heard? [e.g., in some cases "twenty" & in other cases "tweny"]
3- Or, for example, if in a data set the same word is pronounced 7 times in a colloquial way but 3 times in a formal written form, then in these circumstances we choose a more frequent mode and in this case, replace the word in the data set everywhere in a colloquial way.

My personal opinion in answering this question is that in order to make the model less erroneous, in situations that we face a word throughout the data set that is conceptually a single word and has only a different pronunciation or its colloquial form in some cases with the writing form is different; written in one form only. And I think it would be better to write the word in the official (formal) form (because this eliminates the computational overhead of post-processing to convert the model output to formal written text).

Thank you very much for telling me your opinion on this. Also, if you explain that in the 12,000 hour data set that you mentioned and you taught your model trained on it, how are words that are conceptually the same, written? Are they all officially (formal) written or not? (According to the review I had in the LibriSpeech data set, all of these types of words are written in their formal format, even if the pronunciation of the word is different from the formal written form. I have attached a sample file of the word "twenty" in the LibriSpeech (which in its annotated text it was written as "twenty" but the speaker really said "tweny" [audio file dev-clean_1215]).
twenty--dev-clean_1215.zip
Thank you very much for your help.
Best

titu1994 Feb 14, 2022
Maintainer

Follow option (1) as much as possible - consistent format of text everywhere. The quality of your ground truth will dictate the quality of your models transcription - the noisier the ground truth, the greater the struggle for ASR to get acceptable results.

Formal text transcriptions have standard guidelines and well defined rules and automated grammar correction software can further fix text misspellings to format format. It is harder to automate this for colloquial speech fragments, so formal is always preferred.

However the reality is that it is exceptionally difficult to obtain only sources of formal transcriptions. Invariably there will be mistakes in the ground truth and other texts will be part of the asr, this is sadly going to impact ASR accuracy but overall the impact should be minor if the model is trained to convergence.

For the 12k hour dataset, we don't assert that all of it is formal English speech. Models trained on it have not shown any significant signs of incorrect transcriptions though

ahkarami Feb 14, 2022
Author

Thanks for your great answer. I agree with your opinions. Your answer helped me a lot.
Best

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handling Multi Pronunciation Words in ASR #3656

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Handling Multi Pronunciation Words in ASR #3656

Uh oh!

ahkarami Feb 12, 2022

Replies: 1 comment · 3 replies

Uh oh!

titu1994 Feb 13, 2022 Maintainer

Uh oh!

Uh oh!

ahkarami Feb 13, 2022 Author

Uh oh!

titu1994 Feb 14, 2022 Maintainer

Uh oh!

ahkarami Feb 14, 2022 Author

ahkarami
Feb 12, 2022

Replies: 1 comment 3 replies

titu1994
Feb 13, 2022
Maintainer

ahkarami Feb 13, 2022
Author

titu1994 Feb 14, 2022
Maintainer

ahkarami Feb 14, 2022
Author