Skip to content

Manasi2001/Spoofed-Speech-Attribution

Repository files navigation

Spoofed Speech Attribution

This repository focuses on extending the functionality of the 'AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks' [1] model to predict attributes that characterize spoofed speech. The approach introduces a bank of probabilistic detectors that are trained to identify specific features associated with selected spoofing techniques. This results in a comprehensive attribute-based representation of each audio sample. This representation is then analyzed using decision tree modeling to enable accurate spoofed speech detection and detailed explanations for the model's decisions. The dataset selected for the experiments is LA scenario of ASVSpoof 2019.

full_arch

Figure: Complete implementation workflow of the proposed architecture for explainable spoofed speech detection. Phase I demonstrates the extraction of embeddings using the AASIST model and the subsequent processing of these embeddings through a bank of seven probabilistic feature detectors. Phase II illustrates the concatenation of the outputs from these detectors to create a 25-dimensional vector, which is then fed into a decision tree model for classification. This decision tree model is used for both bonafide/spoofed classification and spoofing attack algorithm characterization.

Getting Started

  1. Create a virtual environment using conda (recommended for visualization of decision trees using graphviz application).

    • Download miniconda (https://docs.conda.io/en/latest/miniconda.html).

    • Install miniconda by running the downloaded script.

    • Create a new environment (python=3.10 recommended):

      conda create -n spoof_env python=3.10
      
    • For installing a package:

      conda install -n spoof_env <package_name>
      
    • For installing graphviz executables:

      conda install -n spoof_env graphviz
      
    • Activate the conda environment:

      conda activate spoof_env
      
  2. requirements.txt must be installed for execution.

pip install -r requirements.txt

Data Preparation

To download the ASVspoof 2019 logical access dataset [2]:

python download_dataset.py

(Alternative) Manual preparation is available via:

Phase I

1. Inference Embedding Extraction

The binary output layer of the AASIST model is stripped and the remaining architecture is used to produce 160-dimensional embeddings for all the audios in training, development and evaluation sets.

To extract AASIST embeddings:

python inference_embedding_extraction.py

A set of embeddings is available in Embeddings/AASIST/ for further use.

2. Training Probabilistic Feature Detectors

The ASVSpoof 2019 dataset provides detailed metadata about the characteristics of each spoofing attack by organizing the spoofing methods into seven attribute sets: Input, Input Processor, Duration, Conversion, Speaker Representation, Output, and Waveform Generation. Probabilistic detectors are trained for each of these attribute sets such that they take the 160-dimensional "raw" AASIST embedding as their (shared) input, and are trained against ground truth labels to predict posterior probabilities for assessing the absence or presence of attributes associated with each spoofing attack algorithm.

To train a probabilistic feature detector for an attribute set:

python emb_main.py

The attribute set number, model architecture for probabilistic feature detector, and related parameters can be set in the configuration file emb_model_AASIST.conf. Experimentation shows that a two-layered neural network architecture with 64 and 32 neurons in the hidden layers, respectively, seems to outperform the other tested architectures with one hidden layer of 0, 4, 8, 32 or 64 neurons, and suits best for all the attribute sets. A set of trained probabilistic feature detectors is available in probabilistic_detectors/ for further use.

Phase II

1. Concatenation of Posterior Probabilities

All the audio recordings in the dataset are passed through the seven probabilistic feature detectors. The generated posterior probabilities are concatenated for each audio to form a 25-dimensional embeddings.

To calculate and concatenate the outputs generated by probabilistic feature detectors:

python create_df.py

The dataset (train/dev/eval), the choice to apply softmax or logit functions to the probabilistic feature detectors' outputs and a common model architecture for all the detectors, can be set in the configuration file emb_model_AASIST.conf as shown below.

image

A set of dataframes thus obtained, is available in df_posterior_probabilities/ for further use.

2. Decision Tree Modelling

Decision tree models are trained for making use of these 25-dimensional embeddings for two tasks:

  • Bonafide versus spoof classification:
python decision_tree.py --BonafideSpoof
  • Spoofing attack algorithm attribution:
python decision_tree.py --SpoofAttacks

Results are stored in decision_tree_results/. Relative paths of the dataframes and the maximum depth of decision tree, can be set in the configuration file emb_model_AASIST.conf.

License

MIT License

Copyright (c) 2024 Manasi Chhibber

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Acknowledgements

  • This research has been partially supported by the Academy of Finland (Decision No. 349605, project "SPEECHFAKES"). The author additionally acknowledges CSC – IT Center for Science, Finland, for the use of computational resources.
  • This repository is built on top of AASIST repo.
  • The dataset used here is ASVspoof 2019 [2].

References

[1] AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks

@INPROCEEDINGS{Jung2021AASIST,
  author={Jung, Jee-weon and Heo, Hee-Soo and Tak, Hemlata and Shim, Hye-jin and Chung, Joon Son and Lee, Bong-Jin and Yu, Ha-Jin and Evans, Nicholas},
  booktitle={arXiv preprint arXiv:2110.01200}, 
  title={AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks}, 
  year={2021}

[2] ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech

@article{wang2020asvspoof,
  title={ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech},
  author={Wang, Xin and Yamagishi, Junichi and Todisco, Massimiliano and Delgado, H{\'e}ctor and Nautsch, Andreas and Evans, Nicholas and Sahidullah, Md and Vestman, Ville and Kinnunen, Tomi and Lee, Kong Aik and others},
  journal={Computer Speech \& Language},
  volume={64},
  pages={101114},
  year={2020},
  publisher={Elsevier}
}

About

Opening the black box for attribute-based categorization of spoofed speech.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages