Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification (Interspeech 2025)

We share the python implementation of our paper here.

The figure above depicts approaches for leveraging multi-layer hidden states of pre-trained speech networks for speaker verification, where our methodology can be represented by (c). The below compares the layer-wise utilization adopting WavLM-Base+ as frontend between the renowned SUPERB strategy, represented by (b) above, and our proposal (LAP).

Environment supports & Python requirements

We recommend you to visit Previous Versions (v1.12.0) for PyTorch installation including torchaudio==0.12.0.

Use the requirements.txt to install the rest of the Python dependencies.
Ubuntu-Soundfile and conda-ffmpeg packages would be required for downloading and preprocessing datasets, and you can install them as:

$ pip install -r requirements.txt
$ apt-get install python3-soundfile
$ conda install -c conda-forge ffmpeg

Dataset Preparation

Follow dataprep.sh files from /data/VoxCeleb ; MUSAN ; RIRs to download and preprocess the datasets.

VoxCeleb 1 & 2 [1,2]
MUSAN [3]
Room Impulse Response and Noise Database (RIRs) [4]

[1]   A. Nagrani, et al., “VoxCeleb: A large scale speaker identification dataset,” in Proc. Interspeech, 2017.
[2]   J. S. Chung, et al., “VoxCeleb2: Deep speaker recognition,” in Proc. Interspeech, 2018.
[3]   D. Snyder, et al., “MUSAN: A Music, Speech, and Noise Corpus,” arXiv, 2015.
[4]   T. Ko, et al., “A study on data augmentation of reverberant speech for robust speech recognition,” in Proc. ICASSP, 2017.

Run Experiments

Log files, model weights, and configurations will be saved under /res directory.

The output folder will be created as local-YYYYMMDD-HHmmss format by default.
To use neptune.ai logging, set your neptune configuration at /src/config/neptune.yaml and add --neptune in the command line.
The experiment ID created at your neptune.ai [project] will be the name of the output directory.

This framework supports six-phase model training/evaluation processes.
If starting from phase (2-6), you must pass the --evaluation argument to load model weights.

Pre-training speaker network (backend) from scratch
This cold-start stage is hooked by --train_frozen argument given at the command line.
The optimizer only updates the backend while the frontend remains frozen.

# Example
~/src$ CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --train_frozen

Joint fine-tuning of frontend and backend networks
The second stage is hooked by --train_finetune argument given at the command line.

# Example
~/src$ CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --train_finetune --evaluation_id 'EXP_ID'

Large-margin fine-tuning
The stage is hooked by --train_finetune argument

# Example
~/src$ CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --train_lmft --evaluation_id 'EXP_ID'

Naive evaluation (--naive_evaluation)
Supports cosine-similarity measurement with substitution of training speaker-embedding mean vector.
Adaptive score normalization (--score_normalize)
Produces normalized score of the verification trial given cohort speakers
Quality-aware score calibration (--score_calibrate)
We implement a linear QMF model, which considers speech durations, embedding norms, and variance of embeddings.

# Example of evaluation phases applied in one go.
~/src$ CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --naive_evaluation --score_normalize --score_calibrate --evaluation_id 'EXPID'

General options

~/src$ python main.py -h

argument hooks:
  --train_frozen
  --train_finetune
  --train_lmft
  --naive_evaluation
  --score_normalize
  --score_calibrate

optional arguments:
  -h, --help                      show this help message and exit
  --quick_check                   quick check for the running experiment on the modification, set as True if given
  --neptune                       log experiment with neptune logger, set as True if given
  --description   DESCRIPTION     user parameter for specifying certain version, Defaults to "Untitled".
  --evaluation_id EVALUATION_ID   previous output directory name to load model weights

keyword arguments:
  --kwarg KWARG              dynamically modifies any of the hyperparameters defined at /src/config/*.yaml
  (e.g.) "--batch_size 1024 --layer_aggregation lap --speaker_network astp --frontend_cfg microsoft/wavlm-base-plus"

Comprehensive usage examples

# This will create six neptune experiments; as "EXPID-00", "EXPID-01", "EXPID-02", ...; per phase argument passed.
~/src$ CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py \
        --train_frozen --train_finetune --train_lmft --naive_evaluation --score_normalize --score_calibrate \
        --description "one in a row" --kwargs "--ncpu 16 --n_head 12 --frontend_cfg microsoft/wavlm-base-plus" --neptune;

~/src$ CUDA_VISIBLE_DEVICES=0,1 python main.py --train_frozen --train_finetune \
        --description "if you start from --train_frozen phase, no need to pass --evaluation_id" --kwargs "--batch_size 128";

~/src$ CUDA_VISIBLE_DEVICES=2,3 python main.py --score_normalize --score_calibrate --evaluation_id "EXPID-00" \
        --description "evaluation example" --kwargs "--cohort_size 400" --neptune;

Citation

J. S. Kim, et al., “Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification,” in Proc. Interspeech, 2025.

To appear in Interspeech 2025

License

This repository is released under the MIT license.

Thanks to:

https://github.com/clovaai/voxceleb_trainer
referred to the data preparation codes and adopted the code implementation of evaluation metrics (/src/utils/metrics.py).
https://github.com/wenet-e2e/wespeaker
adopted the implementation for the training loss class AAMsoftmax_IntertopK_Subcenter (/src/loss.py) with slight modifications.
https://github.com/katsura-jp/pytorch-cosine-annealing-with-warmup
adopted for the learning-rate scheduler class CosineAnnealingWarmupRestarts (/src/utils/scheduler.py).
https://github.com/espnet
for the implementation of the speaker augmentation (/src/utils/dataset.py)
https://github.com/SeungjunNah/DeepDeblur-PyTorch
of the customized distributed sampler at the evaluation process class DistributedEvalSampler (/src/utils/sampler.py)
https://github.com/lawlict/ECAPA-TDNN
for the implementation of the speaker model class ECAPA_TDNN (/src/modules/speaker_networks/ecapa_tdnn.py)
https://github.com/JunyiPeng00/SLT22_MultiHead-Factorized-Attentive-Pooling
for the speaker model class MHFA (/src/modules/speaker_networks/mhfa.py) with some modifications to fit this framework.

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
data		data
fig		fig
res		res
src		src
tmp		tmp
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification (Interspeech 2025)

Environment supports & Python requirements

Dataset Preparation

Run Experiments

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

sadPororo/LAP

Folders and files

Latest commit

History

Repository files navigation

Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification (Interspeech 2025)

Environment supports & Python requirements

Dataset Preparation

Run Experiments

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages