Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification (Interspeech 2025)
We share the python implementation of our paper here.
The figure above depicts approaches for leveraging multi-layer hidden states of pre-trained speech networks for speaker verification, where our methodology can be represented by (c). The below compares the layer-wise utilization adopting WavLM-Base+ as frontend between the renowned SUPERB strategy, represented by (b) above, and our proposal (LAP).
- We recommend you to visit Previous Versions (v1.12.0) for PyTorch installation including torchaudio==0.12.0.
Use the requirements.txt to install the rest of the Python dependencies.
Ubuntu-Soundfile and conda-ffmpeg packages would be required for downloading and preprocessing datasets, and you can install them as:
$ pip install -r requirements.txt
$ apt-get install python3-soundfile
$ conda install -c conda-forge ffmpeg
Follow dataprep.sh
files from /data/VoxCeleb ; MUSAN ; RIRs to download and preprocess the datasets.
- VoxCeleb 1 & 2 [1,2]
- MUSAN [3]
- Room Impulse Response and Noise Database (RIRs) [4]
[1] A. Nagrani, et al., “VoxCeleb: A large scale speaker identification dataset,” in Proc. Interspeech, 2017.
[2] J. S. Chung, et al., “VoxCeleb2: Deep speaker recognition,” in Proc. Interspeech, 2018.
[3] D. Snyder, et al., “MUSAN: A Music, Speech, and Noise Corpus,” arXiv, 2015.
[4] T. Ko, et al., “A study on data augmentation of reverberant speech for robust speech recognition,” in Proc. ICASSP, 2017.
Log files, model weights, and configurations will be saved under /res directory.
- The output folder will be created as
local-YYYYMMDD-HHmmss
format by default. - To use neptune.ai logging, set your neptune configuration at /src/config/neptune.yaml and add
--neptune
in the command line.
The experiment ID created at your neptune.ai [project] will be the name of the output directory.
This framework supports six-phase model training/evaluation processes.
If starting from phase (2-6), you must pass the --evaluation
argument to load model weights.
- Pre-training speaker network (backend) from scratch
This cold-start stage is hooked by--train_frozen
argument given at the command line.
The optimizer only updates the backend while the frontend remains frozen.
# Example
~/src$ CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --train_frozen
- Joint fine-tuning of frontend and backend networks
The second stage is hooked by--train_finetune
argument given at the command line.
# Example
~/src$ CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --train_finetune --evaluation_id 'EXP_ID'
- Large-margin fine-tuning
The stage is hooked by--train_finetune
argument
# Example
~/src$ CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --train_lmft --evaluation_id 'EXP_ID'
-
Naive evaluation (
--naive_evaluation
)
Supports cosine-similarity measurement with substitution of training speaker-embedding mean vector. -
Adaptive score normalization (
--score_normalize
)
Produces normalized score of the verification trial given cohort speakers -
Quality-aware score calibration (
--score_calibrate
)
We implement a linear QMF model, which considers speech durations, embedding norms, and variance of embeddings.
# Example of evaluation phases applied in one go.
~/src$ CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --naive_evaluation --score_normalize --score_calibrate --evaluation_id 'EXPID'
General options
~/src$ python main.py -h
argument hooks:
--train_frozen
--train_finetune
--train_lmft
--naive_evaluation
--score_normalize
--score_calibrate
optional arguments:
-h, --help show this help message and exit
--quick_check quick check for the running experiment on the modification, set as True if given
--neptune log experiment with neptune logger, set as True if given
--description DESCRIPTION user parameter for specifying certain version, Defaults to "Untitled".
--evaluation_id EVALUATION_ID previous output directory name to load model weights
keyword arguments:
--kwarg KWARG dynamically modifies any of the hyperparameters defined at /src/config/*.yaml
(e.g.) "--batch_size 1024 --layer_aggregation lap --speaker_network astp --frontend_cfg microsoft/wavlm-base-plus"
Comprehensive usage examples
# This will create six neptune experiments; as "EXPID-00", "EXPID-01", "EXPID-02", ...; per phase argument passed.
~/src$ CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py \
--train_frozen --train_finetune --train_lmft --naive_evaluation --score_normalize --score_calibrate \
--description "one in a row" --kwargs "--ncpu 16 --n_head 12 --frontend_cfg microsoft/wavlm-base-plus" --neptune;
~/src$ CUDA_VISIBLE_DEVICES=0,1 python main.py --train_frozen --train_finetune \
--description "if you start from --train_frozen phase, no need to pass --evaluation_id" --kwargs "--batch_size 128";
~/src$ CUDA_VISIBLE_DEVICES=2,3 python main.py --score_normalize --score_calibrate --evaluation_id "EXPID-00" \
--description "evaluation example" --kwargs "--cohort_size 400" --neptune;
J. S. Kim, et al., “Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification,” in Proc. Interspeech, 2025.
To appear in Interspeech 2025
This repository is released under the MIT license.
Thanks to:
-
https://github.com/clovaai/voxceleb_trainer
referred to the data preparation codes and adopted the code implementation of evaluation metrics (/src/utils/metrics.py). -
https://github.com/wenet-e2e/wespeaker
adopted the implementation for the training lossclass AAMsoftmax_IntertopK_Subcenter
(/src/loss.py) with slight modifications. -
https://github.com/katsura-jp/pytorch-cosine-annealing-with-warmup
adopted for the learning-rate schedulerclass CosineAnnealingWarmupRestarts
(/src/utils/scheduler.py). -
https://github.com/espnet
for the implementation of thespeaker augmentation
(/src/utils/dataset.py) -
https://github.com/SeungjunNah/DeepDeblur-PyTorch
of the customized distributed sampler at the evaluation processclass DistributedEvalSampler
(/src/utils/sampler.py) -
https://github.com/lawlict/ECAPA-TDNN
for the implementation of the speaker modelclass ECAPA_TDNN
(/src/modules/speaker_networks/ecapa_tdnn.py) -
https://github.com/JunyiPeng00/SLT22_MultiHead-Factorized-Attentive-Pooling
for the speaker modelclass MHFA
(/src/modules/speaker_networks/mhfa.py) with some modifications to fit this framework.