How to interpret Pyannet results #1454

PaulSZH95 · 2023-09-03T09:12:52Z

PaulSZH95
Sep 3, 2023

Hi, I am using pyannet for voice activity detection. All code used is the same as the voice_activity_detection notebook given in tutorials.

My question:
I am observing that the inference class gives probability for each 17 ms of and audio.

However, the inference.py of the pyannote repo sets default duration of each chunk to 2 seconds and thus self.step of 0.1 * duration gives 0.2 seconds.

May I know how a sliding window of length 2s with 0.2s step to offer prediction per 0.17ms frame of the audio.

I have taken a look at the inference.py script and couldn't quite figure out the missing link.

Much thanks for any help

hbredin · 2023-09-03T09:16:21Z

hbredin
Sep 3, 2023
Maintainer

https://herve.niderb.fr/posts/2022-10-23-One-speaker-segmentation-model-to-rule-them-all.html

1 reply

PaulSZH95 Sep 3, 2023
Author

Hi Mr Hervé Bredin,

I understand the output to be at 0.016875, what I was hoping to understand is how this was achieved.

From what I understand the model is trained with 2 seconds chunk. After reading your blog I have gained clarity that the step for training is 16 ms. So my understanding now is during training, stepwise, the first 2seconds will be used to predict activity for the first 16ms of the chunk. then the next 2 seconds window for the next 16ms and so on and so forth. Please correct me if i am wrong.

However, when using Inference class to get proba from trained model, which was recommended in the vad tutorial, the step was 0.2. I tried to read and understand the inference script but couldn't make much progress.

What I understand is that the chunking is done with windowing and then a calculation is applied to relevant chunks to map a probability to a period of time.

What I am confused is in the script:

    # step between consecutive chunks
    step = step or (
        0.1 * self.duration if self.warm_up[0] == 0.0 else self.warm_up[0]
    )

could I check if the inference did further chunking per 2s chunk or was there an overwrite of step, defaulting to model.introspection.step instead.

On a high level if I were to do visualization, using inference class, is it safe for me to claim that each 2 second chunk is used to predict the probability of speech per 16ms. I would like to ask this as it would be difficult for me to identify areas where of an input where the model focuses on if this assumption turns out to be untruth.

May I have your help to direct me to relevant papers or blogs if you had covered this before. To help narrow the articles that are relevant.

I have read the paper on sincnet interpretability, the paper on the release of pyannote.audio library, the end-to-end speaker aware segmentation papers. I have looked through the notebook on training a model and also vad model.

Much thanks and appreciation for your time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to interpret Pyannet results #1454

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to interpret Pyannet results #1454

Uh oh!

PaulSZH95 Sep 3, 2023

Replies: 1 comment · 1 reply

Uh oh!

hbredin Sep 3, 2023 Maintainer

Uh oh!

PaulSZH95 Sep 3, 2023 Author

PaulSZH95
Sep 3, 2023

Replies: 1 comment 1 reply

hbredin
Sep 3, 2023
Maintainer

PaulSZH95 Sep 3, 2023
Author