Skip to content
/ LAP Public

Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification, 2025 Interspeech

License

Notifications You must be signed in to change notification settings

sadPororo/LAP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification (Interspeech 2025)

We share the python implementation of our paper here.

The figure above depicts approaches for leveraging multi-layer hidden states of pre-trained speech networks for speaker verification, where our methodology can be represented by (c). The below compares the layer-wise utilization adopting WavLM-Base+ as frontend between the renowned SUPERB strategy, represented by (b) above, and our proposal (LAP).

Environment supports & Python requirements

Ubuntu Python PyTorch

Use the requirements.txt to install the rest of the Python dependencies.
Ubuntu-Soundfile and conda-ffmpeg packages would be required for downloading and preprocessing datasets, and you can install them as:

$ pip install -r requirements.txt
$ apt-get install python3-soundfile
$ conda install -c conda-forge ffmpeg

Dataset Preparation

Follow dataprep.sh files from /data/VoxCeleb ; MUSAN ; RIRs to download and preprocess the datasets.

  • VoxCeleb 1 & 2 [1,2]
  • MUSAN [3]
  • Room Impulse Response and Noise Database (RIRs) [4]

[1]   A. Nagrani, et al., “VoxCeleb: A large scale speaker identification dataset,” in Proc. Interspeech, 2017.
[2]   J. S. Chung, et al., “VoxCeleb2: Deep speaker recognition,” in Proc. Interspeech, 2018.
[3]   D. Snyder, et al., “MUSAN: A Music, Speech, and Noise Corpus,” arXiv, 2015.
[4]   T. Ko, et al., “A study on data augmentation of reverberant speech for robust speech recognition,” in Proc. ICASSP, 2017.

Run Experiments

Log files, model weights, and configurations will be saved under /res directory.

  • The output folder will be created as local-YYYYMMDD-HHmmss format by default.
  • To use neptune.ai logging, set your neptune configuration at /src/config/neptune.yaml and add --neptune in the command line.
    The experiment ID created at your neptune.ai [project] will be the name of the output directory.

This framework supports six-phase model training/evaluation processes.
If starting from phase (2-6), you must pass the --evaluation argument to load model weights.

  1. Pre-training speaker network (backend) from scratch
    This cold-start stage is hooked by --train_frozen argument given at the command line.
    The optimizer only updates the backend while the frontend remains frozen.
# Example
~/src$ CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --train_frozen
  1. Joint fine-tuning of frontend and backend networks
    The second stage is hooked by --train_finetune argument given at the command line.
# Example
~/src$ CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --train_finetune --evaluation_id 'EXP_ID'
  1. Large-margin fine-tuning
    The stage is hooked by --train_finetune argument
# Example
~/src$ CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --train_lmft --evaluation_id 'EXP_ID'
  1. Naive evaluation (--naive_evaluation)
    Supports cosine-similarity measurement with substitution of training speaker-embedding mean vector.

  2. Adaptive score normalization (--score_normalize)
    Produces normalized score of the verification trial given cohort speakers

  3. Quality-aware score calibration (--score_calibrate)
    We implement a linear QMF model, which considers speech durations, embedding norms, and variance of embeddings.

# Example of evaluation phases applied in one go.
~/src$ CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --naive_evaluation --score_normalize --score_calibrate --evaluation_id 'EXPID'

General options

~/src$ python main.py -h

argument hooks:
  --train_frozen
  --train_finetune
  --train_lmft
  --naive_evaluation
  --score_normalize
  --score_calibrate

optional arguments:
  -h, --help                      show this help message and exit
  --quick_check                   quick check for the running experiment on the modification, set as True if given
  --neptune                       log experiment with neptune logger, set as True if given
  --description   DESCRIPTION     user parameter for specifying certain version, Defaults to "Untitled".
  --evaluation_id EVALUATION_ID   previous output directory name to load model weights

keyword arguments:
  --kwarg KWARG              dynamically modifies any of the hyperparameters defined at /src/config/*.yaml
  (e.g.) "--batch_size 1024 --layer_aggregation lap --speaker_network astp --frontend_cfg microsoft/wavlm-base-plus"

Comprehensive usage examples

# This will create six neptune experiments; as "EXPID-00", "EXPID-01", "EXPID-02", ...; per phase argument passed.
~/src$ CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py \
        --train_frozen --train_finetune --train_lmft --naive_evaluation --score_normalize --score_calibrate \
        --description "one in a row" --kwargs "--ncpu 16 --n_head 12 --frontend_cfg microsoft/wavlm-base-plus" --neptune;

~/src$ CUDA_VISIBLE_DEVICES=0,1 python main.py --train_frozen --train_finetune \
        --description "if you start from --train_frozen phase, no need to pass --evaluation_id" --kwargs "--batch_size 128";

~/src$ CUDA_VISIBLE_DEVICES=2,3 python main.py --score_normalize --score_calibrate --evaluation_id "EXPID-00" \
        --description "evaluation example" --kwargs "--cohort_size 400" --neptune;

Citation

J. S. Kim, et al., “Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification,” in Proc. Interspeech, 2025.

To appear in Interspeech 2025

License License: MIT

This repository is released under the MIT license.

Thanks to:

About

Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification, 2025 Interspeech

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published