My VAD performance is terrible! #6700

great-goblin · 2023-05-23T00:17:32Z

great-goblin
May 23, 2023

I'm trying to get a VAD model to work for diarization inference, but the VAD model alone has atrocious performance. I tried following along using the parameters given in this tutorial, but it had very high error rates (95% diarization ER, 80% miss). I've had much better success with using a basic ELAN silence recognizer script, but I'm not experienced enough with the NeMo toolkit to know how to substitute in that 'correct' VAD segmentation. I'm just trying to get the basic System VAD (marblenet) -> segmentation -> embedding extractor -> clustering -> neural diarizer (MSDD) to work.

This links to a graphic I made to show how VAD is currently doing with my .wav fle (red line is ideal segmentation).

The .wav file I'm looking at has been resampled to 16kHz, and features a pilot and air traffic controller speaking to each other. To give a sense of the back-and-forth, this is the .rttm:

SPEAKER WTAN4 1 4.020 3.1800000000000006 <NA> <NA> A <NA>
SPEAKER WTAN4 1 7.700 1.4999999999999991 <NA> <NA> B <NA>
SPEAKER WTAN4 1 11.340 2.2799999999999994 <NA> <NA> B <NA>
SPEAKER WTAN4 1 16.080 6.780000000000001 <NA> <NA> A <NA>
SPEAKER WTAN4 1 25.340 0.8900000000000006 <NA> <NA> B <NA>
SPEAKER WTAN4 1 26.400 4.16 <NA> <NA> A <NA>
SPEAKER WTAN4 1 31.000 4.185000000000002 <NA> <NA> B <NA>
SPEAKER WTAN4 1 35.380 4.559999999999995 <NA> <NA> A <NA>
SPEAKER WTAN4 1 40.480 3.1200000000000045 <NA> <NA> B <NA>

I start by initializing and importing everything per the tutorial linked above. My .rttm's look fine in Annotation format, no issues or any other issues (warnings aside).
I load the diar_infer_telephonic yaml into Omega and modify the following settings. Mostly I followed the tutorial, except I've made more modifications to the diarizer parameters in an attempt to make the VAD look better.

MODEL_CONFIG = os.path.join(data_dir,'diar_infer_telephonic.yaml')
config = OmegaConf.load(MODEL_CONFIG)
config.diarizer.manifest_filepath = 'data/input_manifest.json'
config.diarizer.out_dir = output_dir # Directory to store intermediate files and prediction outputs
pretrained_speaker_model = 'titanet_large'
config.diarizer.speaker_embeddings.model_path = pretrained_speaker_model
config.diarizer.speaker_embeddings.parameters.window_length_in_sec = [1.5,1.25,1.0,0.75,0.5] 
config.diarizer.speaker_embeddings.parameters.shift_length_in_sec = [0.75,0.625,0.5,0.375,0.1] 
config.diarizer.speaker_embeddings.parameters.multiscale_weights= [1,1,1,1,1] 
config.diarizer.oracle_vad = True # ----> ORACLE VAD 
config.diarizer.clustering.parameters.oracle_num_speakers = False
config.diarizer.msdd_model.model_path = 'diar_msdd_telephonic' # Telephonic speaker diarization model 
config.diarizer.msdd_model.parameters.sigmoid_threshold = [0.7, 1.0] # Evaluate with T=0.7 and T=1.0


pretrained_vad = 'vad_multilingual_marblenet'

config.num_workers = 1 # Workaround for multiprocessing hanging with ipython issue 
output_dir = os.path.join(ROOT, 'outputs')
config.diarizer.manifest_filepath = 'data/input_manifest.json'
config.diarizer.out_dir = output_dir #Directory to store intermediate files and prediction outputs
config.diarizer.speaker_embeddings.model_path = pretrained_speaker_model
config.diarizer.oracle_vad = False # compute VAD provided with model_path to vad config
config.diarizer.clustering.parameters.oracle_num_speakers=False

#VAD parameters
config.diarizer.vad.model_path = pretrained_vad
config.diarizer.vad.parameters.window_length_in_sec = 0.5
config.diarizer.vad.parameters.shift_length_in_sec = 0.03
config.diarizer.vad.parameters.smoothing = 'median'
config.diarizer.vad.parameters.overlap = 0.9
config.diarizer.vad.parameters.onset = 0.05
config.diarizer.vad.parameters.offset = 0.7
config.diarizer.vad.parameters.pad_onset = 0.05
config.diarizer.vad.parameters.pad_offset = -0.09
config.diarizer.vad.parameters.min_duration_on = 0.15
config.diarizer.vad.parameters.min_duration_off = 0.05

Then, I run the diarizer and visualize the results.

from nemo.collections.asr.models import ClusteringDiarizer
sd_model = ClusteringDiarizer(cfg=config)
sd_model.diarize()
from nemo.collections.asr.parts.utils.vad_utils import plot

if config.diarizer.vad.parameters.smoothing:
    vad_output_filepath = f'{output_dir}/vad_outputs/overlap_smoothing_output_median_{config.diarizer.vad.parameters.overlap}/{filename}.{config.diarizer.vad.parameters.smoothing}'
else:
    vad_output_filepath = f'{output_dir}/vad_outputs/{filename}.frame'

#verify same lengths:
from helper_debuglengths import compare_lengths
from helper_outputResults import save_vad_to_csv
save_vad_to_csv(OmegaConf.to_yaml(config.diarizer.vad.parameters))
compare_lengths(an4_audio,vad_output_filepath)
plot(
    an4_audio,
    vad_output_filepath, 
    None,
    per_args = config.diarizer.vad.parameters, #threshold
    ) 
# Save the plot as an image
plt.savefig(f'vad_plot{config_name}.png')

I've played a lot with the VAD parameters in step 2 above, but I'm finding it frustrating that a simple ELAN script from two decades ago does a better job at VAD than this sophisticated model. I'm confident it's my own inexperience with this model, so please tell me how I can improve! Ultimately I'd like to actually diarize this model, and if I can figure out how, somehow fine-tune it on my own data, but any suggestions at all would be welcome!

Answered by jeremy110

May 23, 2023

If you want to substitute Nemo's VAD, you can follow these steps:

Using your VAD generate manifest
config.diarizer.vad.model_path = None
config.diarizer.vad.external_vad_manifest = /path/to/your/manifest

The rest of the settings are the same.

View full answer

jeremy110 · 2023-05-23T01:31:22Z

jeremy110
May 23, 2023

If you want to substitute Nemo's VAD, you can follow these steps:

Using your VAD generate manifest
config.diarizer.vad.model_path = None
config.diarizer.vad.external_vad_manifest = /path/to/your/manifest

The rest of the settings are the same.

2 replies

great-goblin May 23, 2023
Author

Thanks Jeremy!

I'm curious how to format my External VAD manifest. Assuming my external VAD produces something like this:

0.0000 9.6900 speech
11.4600 1.4300 speech
14.7000 0.5200 speech
15.9600 6.0500 speech
22.0800 1.4600 speech
24.8400 7.9100 speech
33.3000 11.1200 speech

... how should I load that into the manifest?

meta = {
    'audio_filepath': wtan16_audio, 
    'offset': 0, 
    'duration': duration, 
    'label': 'infer', 
    'text': '-', 
    'num_speakers': None, 
    'rttm_filepath': None, 
    'uem_filepath' : 'nemo_code/data/wtan16_.uem',
    'uniq_id' : 'wtan16',
    'vad_segment_speech??': 'nemo_code/data/vad_output_wtan16.txt'??
}

Digging in the files produced by System VAD, I see this file vad_out.json. Is this the right manifest format?

{"audio_filepath": "/path/to/wav", "offset": 0.0, "duration": 9.69, "label": "UNK", "uniq_id": "wtan16"}
{"audio_filepath": "/path/to/wav", "offset": 11.46, "duration": 1.43, "label": "UNK", "uniq_id": "wtan16"}
{"audio_filepath": "/path/to/wav", "offset": 14.7, "duration": 0.52, "label": "UNK", "uniq_id": "wtan16"}
{"audio_filepath": "/path/to/wav", "offset": 15.96, "duration": 6.05, "label": "UNK", "uniq_id": "wtan16"}
{"audio_filepath": "/path/to/wav", "offset": 22.08, "duration": 1.46, "label": "UNK", "uniq_id": "wtan16"}
{"audio_filepath": "/path/to/wav", "offset": 24.84, "duration": 7.91, "label": "UNK", "uniq_id": "wtan16"}
{"audio_filepath": "/path/to/wav", "offset": 33.3, "duration": 11.12, "label": "UNK", "uniq_id": "wtan16"}

Thanks for your help!

jeremy110 May 24, 2023

Hi @great-goblin
Yes, you need convert your VAD produces to manifest format.
You can refer to the following code.
speech_timestamps is array which store start time and end time, you can apply to your VAD produces.
and config.diarizer.vad.external_vad_manifest = '/workspace/files/vad_outputs/vad_out.json'

with open('/workspace/files/vad_outputs/vad_out.json', 'w') as fp:
    for i in speech_timestamps:
        offset = float(i['start'])
        duration = float(i['end']) - offset
        uniq_id = audio_name
        vad_data={
            'audio_filepath': audio_path,
            'offset': offset,
            'duration': duration,
            'label':'UNK',
            'uniq_id': uniq_id
        }
        json.dump(vad_data, fp)
        fp.write('\n')

and manifest

meta = {
    'audio_filepath': audio_path, 
    'offset': 0, 
    'duration':None, 
    'label': 'infer', 
    'text': '-', 
    'num_speakers': None, 
    'rttm_filepath': None,
    'uem_filepath' : None
}
with open('./input_manifest.json','w') as fp:
    json.dump(meta, fp)
    fp.write('\n')

Hope it helps.

fayejf · 2023-05-23T04:17:37Z

fayejf
May 23, 2023
Collaborator

@great-goblin Thanks for your interest and patience.
If you could share an audio sample, it would be easier for us to investigate the poor prediction of it.

First, though the VAD model has achieved good performance on several benchmarks, no model is perfect especially when it comes to the environment/domain it hasn't seen during training such as a pilot and air traffic controller speaking to each other. To boost performance, you could fine-tune the model on your own data.

I would recommend to try 0.63s for window_length_in_sec which the model was originally trained on. Though as indicated below figure, smaller window_length_in_sec could detect more pauses, the information contained in each segment to VAD model is reduced and could lead to bad performance. That also aligns with your observation that 0.5 is giving much better pred (green line) than 0.15. In telephonic yaml, we set it to 0.15 because we tuned this parameter on some telephonic data and wanted to detect immediate speaker turn, but this may not be suitable for your case. Besides from finetuning the model, if you have a small dev set, you could possibly tune those parameters with this script.

The lower one looks not that bad to me actually, the false alarm of the red boxes could be removed by increasing min_duration_on or change of onset and offset.

This model is trained on segments, so intrinsically the model could not output very tight boundary as annotated in your plot. We are releasing a new VAD model (PRs are under review) which is a frame-level model (output prediction for each frame directly instead of shifting the segments) and trained on more data. And it has better performance than vad_multilingual_marblenet.

Hope it helps!

4 replies

great-goblin May 23, 2023
Author

Thanks Faye! I think I'll just use an external VAD that seems to work generally better for my data in the pipeline.

carlfm01 May 29, 2023

Hello @fayejf ¿How to fine tune this new VAD Model? I see it is based on frames but no mention to the data required or a guide to train it, thanks.

stevehuang52 Jun 6, 2023
Collaborator

Hi @carlfm01, for the new Frame-VAD model, you can prepare a training manifest with the following fields ["audio_filepath", "offset", "duration", "label"] are required. An example of a manifest file is:

{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000,  "label": "0 1 0 0 1"}
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000,  "text": "0 0 0 1 1 1 1 0 0"}

For example, using the default model config, if you have a 1s audio file, you'll need to have 50 frame labels in the manifest entry like "0 0 0 0 1 1 0 1 .... 0 1".
However, shorter label strings are also supported for smaller file sizes. For example, you can prepare the label in 40ms frame, and the model will properly repeat the label for each 20ms frame.

The inference example can be found here.

carlfm01 Jun 7, 2023

@stevehuang52 Thanks!

My VAD performance is terrible! #6700

Uh oh!

Uh oh!

great-goblin May 23, 2023

Replies: 2 comments · 6 replies

Uh oh!

jeremy110 May 23, 2023

Uh oh!

Uh oh!

great-goblin May 23, 2023 Author

Uh oh!

Uh oh!

jeremy110 May 24, 2023

Uh oh!

fayejf May 23, 2023 Collaborator

Uh oh!

great-goblin May 23, 2023 Author

Uh oh!

carlfm01 May 29, 2023

Uh oh!

stevehuang52 Jun 6, 2023 Collaborator

Uh oh!

carlfm01 Jun 7, 2023

great-goblin
May 23, 2023

Replies: 2 comments 6 replies

jeremy110
May 23, 2023

great-goblin May 23, 2023
Author

fayejf
May 23, 2023
Collaborator

great-goblin May 23, 2023
Author

stevehuang52 Jun 6, 2023
Collaborator