Clarification on Frame classification models #6975

pehonnet · 2023-07-05T13:18:29Z

pehonnet
Jul 5, 2023

Hello,

I have played a bit with the frame VAD model, trained and tested some models and found the performance very good. Now I am trying to see how I could run inference at frame level, but I am finding that giving exactly the expected amount of audio (the length of one frame, as in the example is 0.02sec, which is 160 samples for a 8khz file) as input to the model is not working. When I provide a longer segment, I am able to get multiple outputs.

This is the error I am getting when giving only exactly one frame:

    self.stft = lambda x: torch.stft(
  File ".../envs/nemo/lib/python3.8/site-packages/torch/functional.py", line 639, in stft
    input = F.pad(input.view(extended_shape), [pad, pad], pad_mode)
RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (256, 256) at dimension 2 of input [1, 1, 160]

If I give a longer frame (320 samples, so 0.04sec), I am getting 3 frame outputs:

tensor([[[ 2.1742, -2.1432],
         [ 2.0811, -2.0519],
         [ 2.2191, -2.1883]]])

It is not very clear to me why this is happening. I realized that in the main frame VAD inference code, all the frames are processed at once for a file.

My questions are then:

will I get the same results if I provide multiple audio frames at once versus feeding them to the model separately?
why do I get 3 frame outputs when the size of the input is equivalent to 2 frames? (there is no overlap in this setting, only a frame shift of 0.02sec)
how to infer with the model one frame at a time?

Answered by stevehuang52

Jul 6, 2023

Hi @pehonnet , thanks for your questions~

The input to the model shouldn't be shorter than n_fft/2, which is 256 in our default config. The reason lies in how STFT is calculated. Before actually doing STFT, the input is zero-padded with size n_fft/2, where n_fft is the final size of the actual window that STFT will be performed on. If the input is shorter than n_fft/2, the right padding will be included in STFT calculation for the left several audio sample points, which is not desired. Since 160 samples (in your case) is shorter than 256, the error message is raised. Please refer to STFT and librosa for details.

Will I get the same results if I provide multiple audio frames at once versu…

View full answer

titu1994 · 2023-07-05T15:59:04Z

titu1994
Jul 5, 2023
Maintainer

@fayejf

0 replies

fayejf · 2023-07-06T16:27:25Z

fayejf
Jul 6, 2023
Collaborator

@stevehuang52 could you have a look, please?

0 replies

stevehuang52 · 2023-07-06T17:19:15Z

stevehuang52
Jul 6, 2023
Collaborator

Hi @pehonnet , thanks for your questions~

The input to the model shouldn't be shorter than n_fft/2, which is 256 in our default config. The reason lies in how STFT is calculated. Before actually doing STFT, the input is zero-padded with size n_fft/2, where n_fft is the final size of the actual window that STFT will be performed on. If the input is shorter than n_fft/2, the right padding will be included in STFT calculation for the left several audio sample points, which is not desired. Since 160 samples (in your case) is shorter than 256, the error message is raised. Please refer to STFT and librosa for details.

Will I get the same results if I provide multiple audio frames at once versus feeding them to the model separately?

No, since the actual context (receptive field) for the conv layers are different in these two situations. Due to the effective receptive field of the convolutional model is 1.x~2 secs, it's recommended that the input is long enough so that the model could use less zero-paddings.

How to infer with the model one frame at a time?

Same reason as above, it's recommended to use a longer input over very short ones, where short input may lead to worse performance.

why do I get 3 frame outputs when the size of the input is equivalent to 2 frames?

Due to potential paddings, there could be one additional frame output from the model

1 reply

pehonnet Jul 7, 2023
Author

thanks for your answer, that makes sense!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarification on Frame classification models #6975

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Clarification on Frame classification models #6975

Uh oh!

pehonnet Jul 5, 2023

Replies: 3 comments · 1 reply

Uh oh!

titu1994 Jul 5, 2023 Maintainer

Uh oh!

fayejf Jul 6, 2023 Collaborator

Uh oh!

Uh oh!

stevehuang52 Jul 6, 2023 Collaborator

Uh oh!

pehonnet Jul 7, 2023 Author

pehonnet
Jul 5, 2023

Replies: 3 comments 1 reply

titu1994
Jul 5, 2023
Maintainer

fayejf
Jul 6, 2023
Collaborator

stevehuang52
Jul 6, 2023
Collaborator

pehonnet Jul 7, 2023
Author