Clarification on Frame classification models #6975
-
Hello, I have played a bit with the frame VAD model, trained and tested some models and found the performance very good. Now I am trying to see how I could run inference at frame level, but I am finding that giving exactly the expected amount of audio (the length of one frame, as in the example is 0.02sec, which is 160 samples for a 8khz file) as input to the model is not working. When I provide a longer segment, I am able to get multiple outputs. This is the error I am getting when giving only exactly one frame:
If I give a longer frame (320 samples, so 0.04sec), I am getting 3 frame outputs:
It is not very clear to me why this is happening. I realized that in the main frame VAD inference code, all the frames are processed at once for a file. My questions are then:
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
Beta Was this translation helpful? Give feedback.
-
@stevehuang52 could you have a look, please? |
Beta Was this translation helpful? Give feedback.
-
Hi @pehonnet , thanks for your questions~ The input to the model shouldn't be shorter than
No, since the actual context (receptive field) for the conv layers are different in these two situations. Due to the effective receptive field of the convolutional model is 1.x~2 secs, it's recommended that the input is long enough so that the model could use less zero-paddings.
Same reason as above, it's recommended to use a longer input over very short ones, where short input may lead to worse performance.
Due to potential paddings, there could be one additional frame output from the model |
Beta Was this translation helpful? Give feedback.
Hi @pehonnet , thanks for your questions~
The input to the model shouldn't be shorter than
n_fft/2
, which is 256 in our default config. The reason lies in how STFT is calculated. Before actually doing STFT, the input is zero-padded with sizen_fft/2
, wheren_fft
is the final size of the actual window that STFT will be performed on. If the input is shorter thann_fft/2
, the right padding will be included in STFT calculation for the left several audio sample points, which is not desired. Since 160 samples (in your case) is shorter than 256, the error message is raised. Please refer to STFT and librosa for details.