This project is designed to detect deepfake audio using Wav2Vec2, a state-of-the-art self-supervised model developed by Facebook AI for speech representation learning. With the rise of AI-generated content, deepfake audio presents serious threats in domains like cybersecurity, media, legal evidence, and personal privacy. This system leverages the powerful Wav2Vec2 architecture fine-tuned on a labeled dataset consisting of both real human voices and synthetically generated audio samples. The goal is to automatically classify an audio clip as either real or deepfake with high accuracy.
- Deepfake Detection: Identifies whether an audio sample is real or fake.
- Pre-trained Model Support: Uses state-of-the-art deep learning models.
- User-Friendly Interface: Simple script-based execution for ease of use.
- Scalable & Efficient: Can be integrated into real-time applications.
├── app/ # Main application files
│ └── app.py # Script to run inference
├── models/ # Pre-trained models
├── scripts/ # Utility scripts
├── data/ # Contains raw audio files (ignored in .gitignore)
├── upload/ # For storing temporary files (ignored in .gitignore)
├── requirements.txt # Dependencies
├── training.log # Training history
├── .gitignore # Ignoring unnecessary files
└── README.md # Project documentation (this file)
git clone https://github.com/lakshiitakalyanasundaram/Lakshiita_kalyanasundaram.git
cd Lakshiita_kalyanasundaram
pip install -r requirements.txt
python app/app.py --input path/to/audio.wav
-
Preprocessing: The audio file is preprocessed (e.g., noise reduction, feature extraction).
-
Model Inference: The trained model classifies the audio as real or fake.
-
Output: The result is displayed as real (✅) or deepfake (❌).
-
Dataset: Trained on a dataset of real and fake audio samples.
-
Model Used: LSTM + CNN Hybrid.
-
Accuracy: Achieved ~72% accuracy in testing.
-
Media Verification: Detect and verify the authenticity of voice recordings in journalism and broadcasting.
-
Cybersecurity: Prevent voice spoofing attacks in authentication systems, especially in financial and biometric applications.
-
Forensic Analysis: Assist law enforcement and legal investigations by identifying AI-generated speech in evidence materials.
-
AI Ethics & Policy Compliance: Ensure responsible use of generative AI by identifying deepfake content, helping organizations comply with AI ethics policies.
-
AI-Generated Speech Detection: Effectively identify synthetically generated voices from real human speech using deep learning.
I have used the 3004lakshu/for-norm dataset available on Hugging Face. It contains labeled samples of real and deepfake audio.
-
Install the Hugging Face datasets library:
pip install datasets
-
Load the dataset in your script:
from datasets import load_dataset dataset = load_dataset("3004lakshu/Deepfake-Audio")
-
Explore the dataset:
print(dataset)
Link: https://huggingface.co/datasets/3004lakshu/Deepfake-Audio
The classification model is built on top of the Wav2Vec2 encoder. It extracts high-level audio embeddings from the input waveform. These embeddings are passed through a combination of LSTM and CNN layers, which capture temporal and local patterns in the speech signal.
This hybrid model architecture improves classification accuracy by combining both sequence and feature learning capabilities.
-
Preprocessing:
- Audio normalization
- Silence trimming
- Conversion to 16kHz mono WAV
-
Training:
- Loss Function: Binary Cross Entropy
- Optimizer: Adam
- Epochs: 20
- Dataset Split: 80% train, 20% test
- Validation Accuracy: ~72%
This repository contains a fine-tuned .pth
model using the wav2vec2
architecture for audio deepfake detection.
The model has been trained and optimized on a custom dataset for the task.
-
Install the Hugging Face Hub library (if you haven't already):
pip install huggingface_hub huggingface-cli login
-
Download the model:
from huggingface_hub import hf_hub_download model_path = hf_hub_download( repo_id="3004lakshu/wav2vec2_trained", filename="deepfake_model.pth" )
https://huggingface.co/3004lakshu/wav2vec2_trained
To understand the goals, objectives, and evaluation criteria of this internship task, please refer to the detailed documentation below:
Task Document (Google Docs):
View Full Internship Task Document
A screen recording of the Streamlit interface demonstrating the detection process has been included.
Watch the demo here:
Feel free to fork the repo, open an issue, or submit a pull request to improve the project!
MIT License. Use it freely but give credits where due.
If this project helps you, give it a star on GitHub!