How do I use the exported Fastpitch ONNX model to generate spectrogram from raw text using `onnxruntime`? #4130

godspirit00 · 2022-05-08T11:59:48Z

godspirit00
May 8, 2022

Hello,
I managed to export a FastPitch model I trained to ONNX using export.py according to #4100 .

My question is, how do I use the exported Fastpitch ONNX model to generate spectrogram from raw text using onnxruntime?

Since I intend to run the model on a Windows environment with no GPU, I choose onnxruntime instead of nemo2riva.

The code I used is:

ort_sess_fp = ort.InferenceSession(r"/home/xxx/nancy_fastpitch-44k-new-v3.onnx")
outputs_fp = ort_sess_fp.run(None, {'text': np.array([txt]), 'pitch': np.array([[0]],dtype=np.float32), 'pace': np.array([[pace]],dtype = np.float32)})

Not sure if it is correct.
And the text input requires to be int64 instead of string. Looks like the raw text needs to be parsed and tokenized first. But it is done by the model itself in NeMo. So how can I do the same with the ONNX model?

Thanks in advance.

XuesongYang · 2022-08-17T00:39:13Z

XuesongYang
Aug 17, 2022
Collaborator

@borisfom, could you please have a look at this? Thanks.

3 replies

godspirit00 Aug 17, 2022
Author

I managed to rip the code of tokenization out and now it can work without NeMo. Thanks anyway!

XuesongYang Aug 17, 2022
Collaborator

Thanks! It would be great if you could share your detailed solution here iff it is not complicated.

godspirit00 Aug 18, 2022
Author

It's not difficult.
What is needed are BaseG2p and EnglishG2p in nemo.collections.tts.torch.g2ps, nemo.collections.tts.torch.en_utils, and BaseTokenizer and EnglishPhonemesTokenizer in nemo.collections.tts.torch.tts_tokenizers. Put the code in a file, and instantiate them:

curdir=os.path.split(os.path.abspath(__file__))[0]
## Set according to fastpitch.py in NeMo
g2p=EnglishG2p(phoneme_dict=os.path.join(curdir,"data/cmudict-0.7b_nv22.01"),heteronyms=os.path.join(curdir,"data/heteronyms-030921"),phoneme_probability=1.0)
tokenizer=EnglishPhonemesTokenizer(g2p=g2p, punct = True, stresses= True, chars = True, apostrophe = True, pad_with_space = True, add_blank_at = True)

And yes, some data files cmudict-0.7b_nv22.01 and heteronyms-030921 are needed too, get them from scripts/tts_dataset_files
Now it's possible to tokenize a text with tokenizer.encode(txt) and pass the return value to the model.

daminnock · 2023-01-04T10:48:50Z

daminnock
Jan 4, 2023

In my case, I succeed with the same solution that @godspirit00 explained.
It works ok when I do the inference for a normal speech with 'pitch': np.array([[0]] as input.

But when I want to do a solemn speech as explained in this tutorial but with ONNX it does not work ok.
https://github.com/NVIDIA/NeMo/blob/stable/tutorials/tts/Inference_DurationPitchControl.ipynb#Putting-it-all-together

Here the code of interest:

with torch.no_grad():
    spec_norm, audio_norm, durs_norm_pred, pitch_norm_pred = str_to_audio(input_string)
    
    # Let's try to make the speech more solemn
    # Let's deamplify the pitch and shift the pitch down by 75% of 1 standard deviation
    pitch_sol = (pitch_norm_pred)*0.75-0.75
    # Fastpitch tends to raise the pitch before "loss" which sounds inappropriate. Let's just remove that pitch raise
    pitch_sol[0][-5] += 0.2
    # Now let's pass our new pitch to fastpitch with a 90% pacing to slow it down
    spec_sol, audio_sol, durs_sol_pred, _ = str_to_audio(input_string, pitch=pitch_sol, pace=0.9)

There you can see that it makes a first inference to obtain pitch_norm_pred. Then it modifies this array and uses it as input again to obtain a solemn voice.

The inference with ONNX is as if the speech ran out of air from the third word of each sentence. If I compare the same experiment but with original fastpitch model, it works fine.

As explained in the tutorial:
FastPitch differs from some other models as it predicts a pitch difference to a normalized (mean 0, std 1)

Is it possible that the input for ONNX should be unnormalized? If that is the case... How do I do that?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How do I use the exported Fastpitch ONNX model to generate spectrogram from raw text using `onnxruntime`? #4130

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How do I use the exported Fastpitch ONNX model to generate spectrogram from raw text using onnxruntime? #4130

Uh oh!

Uh oh!

godspirit00 May 8, 2022

Replies: 2 comments · 3 replies

Uh oh!

XuesongYang Aug 17, 2022 Collaborator

Uh oh!

godspirit00 Aug 17, 2022 Author

Uh oh!

XuesongYang Aug 17, 2022 Collaborator

Uh oh!

godspirit00 Aug 18, 2022 Author

Uh oh!

daminnock Jan 4, 2023

How do I use the exported Fastpitch ONNX model to generate spectrogram from raw text using `onnxruntime`? #4130

godspirit00
May 8, 2022

Replies: 2 comments 3 replies

XuesongYang
Aug 17, 2022
Collaborator

godspirit00 Aug 17, 2022
Author

XuesongYang Aug 17, 2022
Collaborator

godspirit00 Aug 18, 2022
Author

daminnock
Jan 4, 2023