Whisper Fine-Tuning Experiment

This project explores the tradeoff between dataset-specific performance and generalizability when fine-tuning OpenAI's Whisper speech recognition model.

Experiment Overview

The goal of this experiment was to improve Whisper's performance on the LibriSpeech dataset while observing how different fine-tuning approaches affect generalizability to out-of-sample audio.

Key Findings

LibriSpeech Performance Improvement

Fine-tuning successfully reduced Word Error Rate (WER) on the LibriSpeech dataset:

The base model achieved a WER of ~0.19 on test-clean and ~0.27 on test-other
Fine-tuning with just one epoch significantly improved performance
Further epochs continued to reduce WER on the target dataset

Generalizability Problem

However, as shown in logs/old-logs/out-of-sample-eval.log, the fine-tuned models completely lost their ability to transcribe simple English phrases outside the training distribution:

Base model (no fine-tuning):
- Successfully transcribed "Hello, my name is Izaak" and "Hello, my name is Tolga"
- Overall WER: 0.4
Fine-tuned models (standard approach):
- Failed to transcribe simple phrases, producing outputs like "HELLO MY MAIMS ISICK"
- Overall WER increased dramatically to 1.0-1.1

Mitigation Strategies

To address catastrophic forgetting, several techniques were implemented:

LoRA (Low-Rank Adaptation): Fine-tuning only a small set of parameters
KL Divergence: Keeping model outputs close to the original model
EWC (Elastic Weight Consolidation): Preserving important parameters for general tasks

These techniques successfully retained knowledge on the small out-of-sample dataset, demonstrating effective approaches to balance domain-specific performance improvements with general language capabilities.

Visualization

The project includes code to visualize log-mel spectrograms of audio inputs, providing insights into how the model processes speech data. Example visualizations can be found in the visualize/ directory.

Project Structure

whisper/
├── data/                # LibriSpeech dataset (downloaded automatically)
├── logs/                # Training and evaluation logs
├── models/              # Saved model checkpoints
├── src/                 # Source code
│   ├── config.py        # Configuration settings
│   ├── dataset.py       # LibriSpeech dataset loader
│   ├── main.py          # Main entry point
│   ├── trainer.py       # Training functionality
│   ├── utils.py         # Evaluation utilities
│   └── visualize.py     # Spectogram visualization
└── visualize/           # Generated spectogram images

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
data		data
logs		logs
src		src
visualize		visualize
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Whisper Fine-Tuning Experiment

Experiment Overview

Key Findings

LibriSpeech Performance Improvement

Generalizability Problem

Mitigation Strategies

Visualization

Project Structure

About

Uh oh!

Releases

Packages

Languages

License

tolgadur/finetuned-whisper

Folders and files

Latest commit

History

Repository files navigation

Whisper Fine-Tuning Experiment

Experiment Overview

Key Findings

LibriSpeech Performance Improvement

Generalizability Problem

Mitigation Strategies

Visualization

Project Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages