My VAD performance is terrible! #6700
-
I'm trying to get a VAD model to work for diarization inference, but the VAD model alone has atrocious performance. I tried following along using the parameters given in this tutorial, but it had very high error rates (95% diarization ER, 80% miss). I've had much better success with using a basic ELAN silence recognizer script, but I'm not experienced enough with the NeMo toolkit to know how to substitute in that 'correct' VAD segmentation. I'm just trying to get the basic System VAD (marblenet) -> segmentation -> embedding extractor -> clustering -> neural diarizer (MSDD) to work. This links to a graphic I made to show how VAD is currently doing with my .wav fle (red line is ideal segmentation). The .wav file I'm looking at has been resampled to 16kHz, and features a pilot and air traffic controller speaking to each other. To give a sense of the back-and-forth, this is the .rttm:
MODEL_CONFIG = os.path.join(data_dir,'diar_infer_telephonic.yaml')
config = OmegaConf.load(MODEL_CONFIG)
config.diarizer.manifest_filepath = 'data/input_manifest.json'
config.diarizer.out_dir = output_dir # Directory to store intermediate files and prediction outputs
pretrained_speaker_model = 'titanet_large'
config.diarizer.speaker_embeddings.model_path = pretrained_speaker_model
config.diarizer.speaker_embeddings.parameters.window_length_in_sec = [1.5,1.25,1.0,0.75,0.5]
config.diarizer.speaker_embeddings.parameters.shift_length_in_sec = [0.75,0.625,0.5,0.375,0.1]
config.diarizer.speaker_embeddings.parameters.multiscale_weights= [1,1,1,1,1]
config.diarizer.oracle_vad = True # ----> ORACLE VAD
config.diarizer.clustering.parameters.oracle_num_speakers = False
config.diarizer.msdd_model.model_path = 'diar_msdd_telephonic' # Telephonic speaker diarization model
config.diarizer.msdd_model.parameters.sigmoid_threshold = [0.7, 1.0] # Evaluate with T=0.7 and T=1.0
pretrained_vad = 'vad_multilingual_marblenet'
config.num_workers = 1 # Workaround for multiprocessing hanging with ipython issue
output_dir = os.path.join(ROOT, 'outputs')
config.diarizer.manifest_filepath = 'data/input_manifest.json'
config.diarizer.out_dir = output_dir #Directory to store intermediate files and prediction outputs
config.diarizer.speaker_embeddings.model_path = pretrained_speaker_model
config.diarizer.oracle_vad = False # compute VAD provided with model_path to vad config
config.diarizer.clustering.parameters.oracle_num_speakers=False
#VAD parameters
config.diarizer.vad.model_path = pretrained_vad
config.diarizer.vad.parameters.window_length_in_sec = 0.5
config.diarizer.vad.parameters.shift_length_in_sec = 0.03
config.diarizer.vad.parameters.smoothing = 'median'
config.diarizer.vad.parameters.overlap = 0.9
config.diarizer.vad.parameters.onset = 0.05
config.diarizer.vad.parameters.offset = 0.7
config.diarizer.vad.parameters.pad_onset = 0.05
config.diarizer.vad.parameters.pad_offset = -0.09
config.diarizer.vad.parameters.min_duration_on = 0.15
config.diarizer.vad.parameters.min_duration_off = 0.05
from nemo.collections.asr.models import ClusteringDiarizer
sd_model = ClusteringDiarizer(cfg=config)
sd_model.diarize()
from nemo.collections.asr.parts.utils.vad_utils import plot
if config.diarizer.vad.parameters.smoothing:
vad_output_filepath = f'{output_dir}/vad_outputs/overlap_smoothing_output_median_{config.diarizer.vad.parameters.overlap}/{filename}.{config.diarizer.vad.parameters.smoothing}'
else:
vad_output_filepath = f'{output_dir}/vad_outputs/{filename}.frame'
#verify same lengths:
from helper_debuglengths import compare_lengths
from helper_outputResults import save_vad_to_csv
save_vad_to_csv(OmegaConf.to_yaml(config.diarizer.vad.parameters))
compare_lengths(an4_audio,vad_output_filepath)
plot(
an4_audio,
vad_output_filepath,
None,
per_args = config.diarizer.vad.parameters, #threshold
)
# Save the plot as an image
plt.savefig(f'vad_plot{config_name}.png') I've played a lot with the VAD parameters in step 2 above, but I'm finding it frustrating that a simple ELAN script from two decades ago does a better job at VAD than this sophisticated model. I'm confident it's my own inexperience with this model, so please tell me how I can improve! Ultimately I'd like to actually diarize this model, and if I can figure out how, somehow fine-tune it on my own data, but any suggestions at all would be welcome! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 6 replies
-
If you want to substitute Nemo's VAD, you can follow these steps:
The rest of the settings are the same. |
Beta Was this translation helpful? Give feedback.
-
@great-goblin Thanks for your interest and patience. First, though the VAD model has achieved good performance on several benchmarks, no model is perfect especially when it comes to the environment/domain it hasn't seen during training such as a pilot and air traffic controller speaking to each other. To boost performance, you could fine-tune the model on your own data. I would recommend to try 0.63s for ![]() The lower one looks not that bad to me actually, the false alarm of the red boxes could be removed by increasing This model is trained on segments, so intrinsically the model could not output very tight boundary as annotated in your plot. We are releasing a new VAD model (PRs are under review) which is a frame-level model (output prediction for each frame directly instead of shifting the segments) and trained on more data. And it has better performance than Hope it helps! |
Beta Was this translation helpful? Give feedback.
If you want to substitute Nemo's VAD, you can follow these steps:
The rest of the settings are the same.