Audio Transcription Project

Python project for audio transcription using OpenAI Whisper and Mistral Voxtral models.

Usage

This project provides a command-line tool for transcribing audio files using OpenAI's Whisper models and Mistral's Voxtral models. You can transcribe individual files or batch process entire directories with support for multiple output formats.

Quick Start

After setting up the environment (see Development setup below):

# Activate the environment
uv sync

# Transcribe all audio files in the audio/ directory to CSV
uv run python src/transcribe_audio.py

# Or use specific options
uv run python src/transcribe_audio.py --model whisper-tiny --format json

uv run python src/transcribe_audio.py --model whisper-tiny --format duckdb

Command Line Interface

The transcription script supports several command-line options:

uv run python src/transcribe_audio.py [OPTIONS]

Available Options:

--model MODEL: Choose the transcription model - Whisper or Voxtral (default: whisper-small)
--format FORMAT: Output format for results (default: csv)
--language LANGUAGE: Language code for transcription (default: en). Whisper supports 99 languages, Voxtral supports 8.
--max-new-tokens TOKENS: Maximum number of tokens to generate (default: 400). Whisper models have a maximum limit of 448 tokens.
--input-path PATH: Directory containing audio files (default: ./audio)
--output-path PATH: Directory for output files (default: ./output)
--all-audio: Re-process all files, including previously transcribed ones

Available Models

Whisper Models (OpenAI)

Choose from different Whisper models based on your speed vs accuracy needs:

Model	Description	Size	Use Case
`whisper-tiny`	Fastest model, least accurate	~39 MB	Quick testing, real-time
`whisper-small`	Fast model, good accuracy	~244 MB	Recommended default
`whisper-medium`	Balanced speed/accuracy	~769 MB	High-quality transcription
`whisper-large-v3-turbo`	Best accuracy, slower	~1550 MB	Maximum quality needed

Whisper Language Support: Supports 99 languages including English, Spanish, French, German, Chinese, Japanese, Korean, Arabic, Hindi, and many more. Use ISO 639-1 language codes (e.g., en, es, fr, de, zh, ja, ko, ar, hi).

Voxtral Models (Mistral AI) - Optional

For multilingual speech recognition with advanced capabilities:

Model	Description	Size	Use Case
`voxtral-mini`	Multilingual ASR model	~3B params	Fast multilingual transcription
`voxtral-small`	High-quality multilingual ASR	~24B params	Best multilingual accuracy

Voxtral Language Support: Currently supports 8 languages: English (en), Spanish (es), French (fr), Portuguese (pt), Hindi (hi), German (de), Dutch (nl), and Italian (it).

Note: Voxtral models require additional dependencies. See Voxtral Setup below.

Output Formats

Save your transcriptions in multiple formats:

Format	Extension	Description	Best For
`csv`	`.csv`	Comma-separated values	Excel, data analysis
`json`	`.json`	JavaScript Object Notation	Web applications, APIs
`parquet`	`.parquet`	Apache Parquet columnar	Big data, analytics
`duckdb`	`.duckdb`	DuckDB database	SQL queries, complex analysis

Usage Examples

Basic transcription:

# Transcribe all audio files with default settings
uv run python src/transcribe_audio.py

Choose model and format:

# Use tiny Whisper model for fast processing, save as JSON
uv run python src/transcribe_audio.py --model whisper-tiny --format json

# Use large Whisper model for best quality, save to database
uv run python src/transcribe_audio.py --model whisper-large-v3-turbo --format duckdb

# Use Voxtral model for multilingual transcription (requires setup)
uv run python src/transcribe_audio.py --model voxtral-mini --format json

Language-specific transcription:

# Transcribe Spanish audio
uv run python src/transcribe_audio.py --language es

# Transcribe French audio with more tokens for detailed output
uv run python src/transcribe_audio.py --language fr --max-new-tokens 600

# Use Voxtral for German transcription
uv run python src/transcribe_audio.py --model voxtral-mini --language de

# Transcribe Hindi audio with Whisper (Voxtral also supports Hindi)
uv run python src/transcribe_audio.py --language hi --model whisper-medium

Token length control:

# Short summaries (fewer tokens)
uv run python src/transcribe_audio.py --max-new-tokens 200

# Default length
uv run python src/transcribe_audio.py  # Uses 400 tokens

# Longer transcriptions (Whisper max is 448)
uv run python src/transcribe_audio.py --max-new-tokens 448

# Voxtral models can use more tokens
uv run python src/transcribe_audio.py --max-new-tokens 800 --model voxtral-mini

Re-process all files:

# Force re-transcription of all files (ignores existing results)
uv run python src/transcribe_audio.py --all-audio --format parquet

Production batch processing:

# Process large batches with balanced Whisper model
uv run python src/transcribe_audio.py --model whisper-medium --format duckdb

# Process multilingual content with Voxtral
uv run python src/transcribe_audio.py --model voxtral-small --format parquet --all-audio

# Batch process Spanish content with custom token limit
uv run python src/transcribe_audio.py --model whisper-medium --language es --max-new-tokens 500 --format duckdb

# High-quality multilingual processing
uv run python src/transcribe_audio.py --model voxtral-small --language fr --max-new-tokens 600 --format parquet

# Process files from custom directories
uv run python src/transcribe_audio.py --input-path /custom/audio --output-path /custom/results

# Batch process with custom paths and settings
uv run python src/transcribe_audio.py --input-path ~/recordings --output-path ~/transcriptions --model whisper-medium --format duckdb

Input Requirements

Supported Audio Formats:

MP3 (.mp3)
WAV (.wav)
FLAC (.flac)
M4A (.m4a)
OGG (.ogg)

File Organization:

Place audio files in the audio/ directory (or specify custom path with --input-path)
The script automatically discovers all supported audio files
Files are processed in alphabetical order
Results are saved to the output/ directory (or specify custom path with --output-path)

Output Structure

All transcription results include:

File ID: Unique identifier based on filename and size
Filename: Original audio file name
File Size: Size in bytes
Transcription Time: Processing duration in seconds
Transcription Text: The actual transcribed text
Model ID: Which Whisper model was used
Timestamps: When processing started and completed

Example CSV Output:

file_id,filename,file_size_bytes,transcription_time_seconds,transcription_text,model_id,started_at,processed_at
a1b2c3d4e5f6g7h8,sample.mp3,1048576,2.34,"Hello world, this is a test recording.",openai/whisper-small,2024-01-01T12:00:00Z,2024-01-01T12:00:02Z

Performance and GPU Support

Automatic Device Detection:

Uses CUDA GPU if available for faster processing
Falls back to CPU automatically
Model precision adjusted based on device (float16 for GPU, float32 for CPU)

Processing Speed Examples:

CPU: ~5-10x real-time (10 second audio = 50-100 seconds processing)
GPU: ~20-50x real-time (10 second audio = 5-20 seconds processing)
Actual speed varies by model size and hardware

Incremental Processing

The script avoids re-processing files by:

Generating unique file IDs based on filename + file size
Checking existing results in the output file
Skipping previously processed files (unless --all-audio is used)
Appending new results to existing output files

This makes it efficient for processing large directories incrementally.

Error Handling

The script handles various error conditions gracefully:

Missing audio directory: Creates directory if needed
Unsupported file formats: Skips with warning
Corrupted audio files: Continues processing other files
Model loading errors: Provides clear error messages
Disk space issues: Fails gracefully with error details

Integration Examples

Using with other tools:

# Process audio and analyze results with DuckDB
uv run python src/transcribe_audio.py --format duckdb
echo "SELECT model_id, AVG(transcription_time_seconds) FROM transcriptions GROUP BY model_id;" | duckdb output/transcribed_audio.duckdb

# Export to CSV for Excel analysis
uv run python src/transcribe_audio.py --format csv
# Open output/transcribed_audio.csv in Excel

# Process and convert to different format
uv run python src/transcribe_audio.py --format parquet
# Use pandas/polars to read parquet file in data science workflows

Monitoring progress:

# Watch processing in real-time
uv run python src/transcribe_audio.py --model whisper-small --format json 2>&1 | tee transcription.log

Voxtral Model Setup

To use Mistral's Voxtral models for multilingual speech recognition, you need to install additional dependencies.

Prerequisites for Voxtral

Voxtral models require the latest development version of the transformers library and additional audio processing dependencies.

Installation Steps

Install development transformers (required for Voxtral support):
```
uv pip install git+https://github.com/huggingface/transformers
```

Install Mistral audio dependencies:

uv pip install --upgrade "mistral-common[audio]"

Activate the environment and verify installation:
```
source .venv/bin/activate
python src/transcribe_audio.py --help | grep -A 10 "Available models:"
```
You should see both Whisper and Voxtral models listed if installation was successful.

Important: You must activate the virtual environment with source .venv/bin/activate before testing Voxtral models to ensure proper dependency resolution.

Note: These extra installation steps may become obsolete once Voxtral models are available in a future stable release of HuggingFace transformers.

Voxtral vs Whisper Comparison

Feature	Whisper	Voxtral
Languages	99+ languages	8 languages (en, es, fr, pt, hi, de, nl, it)
Model Size	39MB - 1.5GB	3B - 24B parameters
Speed	Fast to moderate	Moderate to slow
Accuracy	High for English	Very high for supported languages
Dependencies	Standard transformers	Development transformers + mistral-common
Use Case	General transcription	Advanced multilingual ASR
Token Control	Yes (--max-new-tokens)	Yes (--max-new-tokens)

Troubleshooting Voxtral

If Voxtral models don't appear:

Ensure you installed the development version of transformers
Check that mistral-common[audio] is properly installed
Restart your environment after installation

If you get import errors:

# Clean reinstall
uv pip uninstall transformers mistral-common
uv pip install git+https://github.com/huggingface/transformers
uv pip install --upgrade "mistral-common[audio]"

Performance considerations:

Voxtral models require more GPU memory than Whisper
Use voxtral-mini for faster processing
Use voxtral-small only with sufficient GPU memory (>8GB recommended)

Development set up

Prerequisites

This project requires Python 3.11 or 3.12 (not 3.13 due to dependency constraints) and the following tools:

Python 3.11-3.12: Required for compatibility with audio processing dependencies
uv: Modern Python package manager for dependency management
just: Command runner for common development tasks
cmake: Required for building audio processing dependencies
git: Version control

Installation

1. Install System Dependencies

On macOS/Linux with Homebrew:

brew install just uv cmake

On Windows:

winget install Casey.Just astral-sh.uv
# Install cmake via Visual Studio Build Tools or from cmake.org

On Linux (Ubuntu/Debian):

# Install via Homebrew
brew install just uv
# Install other tools via apt
sudo apt install cmake build-essential pkg-config

2. Clone and Setup Project

git clone <repository-url>
cd audio-transcription

3. Environment Setup

Quick setup (recommended):

just get-started

Manual setup:

# Create virtual environment with correct Python version
uv venv --python 3.12
uv sync

4. Activate Environment

The project uses uv for environment management:

# Activate the environment
uv shell

# Or run commands directly with uv
uv run python your_script.py
uv run jupyter lab

Development Workflow

Common development commands using just:

just lab              # Launch Jupyter Lab
just lint-py          # Lint Python code
just fmt-python       # Format Python code
just fmt-all          # Format all code (Python, SQL, Markdown)
just pre-commit-run   # Run pre-commit hooks
just test             # Run core test suite
just test-cov         # Run tests with coverage report

Testing

This project includes a comprehensive test suite with 74 passing tests that validate all aspects of the audio transcription functionality.

Test Structure

tests/
├── assets/                    # Test audio files for integration tests
│   └── audio/                 # Real audio files (add your own)
├── integration/               # Integration tests with real files
│   └── test_real_audio.py     # Tests using actual audio files
├── unit/                      # Fast unit tests (mocked)
│   ├── test_file_operations.py    # ✅ File ID, audio discovery, sizes
│   ├── test_data_formats.py       # ✅ CSV, JSON, Parquet, DuckDB ops
│   ├── test_model_loading.py      # ✅ Whisper model loading & validation
│   ├── test_transcription.py      # ✅ Core transcription with mocks
│   ├── test_cli.py                # ✅ CLI argument parsing
│   └── test_error_handling.py     # ⚠️  Error handling (3 edge cases)
└── conftest.py                # Shared fixtures and test utilities

Running Tests

Quick Testing (Recommended):

# Run core working tests (fast, ~6 seconds)
just test
# Or: uv run python -m pytest

Comprehensive Testing:

# Run all working unit tests
just test-unit

# Run integration tests with real audio files
just test-integration

# Run slow/comprehensive tests
just test-slow

# Run ALL tests (including broken ones for debugging)
just test-all

# Run only broken tests for debugging
just test-broken

Coverage Reports:

# Terminal coverage report
just test-cov

# HTML coverage report (opens in browser)
just test-cov-html

# XML coverage report (for CI)
just test-cov-xml

Test Categories

✅ Working Tests (74 tests):

File Operations (12 tests): File ID generation, audio discovery, file sizes
Data Formats (22 tests): Save/load operations for CSV, JSON, Parquet, DuckDB
Model Loading (13 tests): Whisper model validation, device handling
Transcription Core (11 tests): Audio processing with mocked models
CLI Interface (16 tests): Command-line argument parsing and workflow integration

⚠️ Broken Tests (3 tests):

Error Handling: 3 filesystem permission/error detection edge cases

Adding Real Audio Files for Integration Testing

Place audio files in tests/assets/audio/
Keep files small (< 1MB each, 1-30 seconds duration)
Document sources in tests/assets/README.md
Run integration tests:
```
just test-integration
```

Integration tests will skip gracefully if no audio files are present.

Test Design Principles

Fast by default: Core tests run in ~6 seconds using mocks
No model downloads: Uses mocked ML models to avoid heavy downloads
Graceful skipping: Integration tests skip when real audio files unavailable
Comprehensive coverage: Tests all output formats and error scenarios
CI-ready: Provides detailed coverage reports for continuous integration

Code Quality

The project enforces high code quality standards through automated tools:

Code Formatting and Linting

# Format all code (Python, SQL, Markdown)
just fmt-all

# Format only Python code
just fmt-python

# Lint Python code with ruff
just lint-py

# Run all pre-commit hooks
just pre-commit-run

Pre-commit Hooks

The project uses comprehensive pre-commit hooks that run automatically before each commit:

File validation: YAML, JSON, TOML syntax checking
Python validation: validate-pyproject for pyproject.toml
Spell checking: codespell with custom ignore list
Markdown formatting: markdownlint-fix with auto-fixing
Python formatting: ruff-format for consistent code style
Python linting: ruff-check with comprehensive rule set

Setup pre-commit hooks:

# Install pre-commit hooks (run once)
uv run pre-commit install

# Run hooks manually on all files
just pre-commit-run

# Update hook versions
just update-reqs

Code Style Standards

Line length: 88 characters (follows Black standard)
Python target: 3.12 compatibility
Docstrings: Required for public functions and classes
Type hints: Encouraged but not enforced
Import sorting: Automatic via ruff
Formatting: Automatic via ruff (replaces Black)

IDE Setup

VS Code (recommended):

Install Python extension
Set Python interpreter to .venv/bin/python (or .venv/Scripts/python.exe on Windows)
Install Jupyter extension for notebook support
Install Python Test Explorer for integrated test running

Demo Notebooks

The project includes demonstration notebooks showcasing different transcription models:

Available Demos

notebooks/demo_whisper_transcription.ipynb: Demonstrates audio transcription using OpenAI's Whisper model
- Uses the openai/whisper-small model for faster inference
- Includes interactive audio players for testing
- Shows transcription accuracy comparison with original text
notebooks/demo_voxtral_transcription.ipynb: Demonstrates audio transcription using Mistral's Voxtral model
- Uses the mistralai/Voxtral-Mini-3B-2507 model
- Multilingual speech recognition capabilities
- Interactive audio processing with timing metrics

Running the Demos

# Launch Jupyter Lab
just lab

# Or run directly with uv
uv run jupyter lab

Both notebooks include sample audio files and demonstrate:

Model loading and setup
Audio file processing with interactive players
Transcription accuracy comparison
Performance timing metrics

Contributing

Development Workflow

Fork and clone the repository
Set up development environment:
```
just get-started
uv shell
```
Install pre-commit hooks:
```
uv run pre-commit install
```

Create a feature branch:

git checkout -b feature/your-feature-name

Make your changes following the code style standards
Run tests to ensure everything works:
```
just test
just lint-py
```

Commit your changes (pre-commit hooks will run automatically):

git add .
git commit -m "Add your descriptive commit message"

Push and create a pull request

Adding New Features

When adding new functionality:

Write tests first (TDD approach recommended)
Update documentation in README.md and CLAUDE.md
Ensure code quality by running just fmt-all and just lint-py
Add integration tests if working with real audio files
Update notebooks if the feature affects demo functionality

Code Review Checklist

Troubleshooting

Python 3.13 Issues: If you encounter errors with sentencepiece or other dependencies, ensure you're using Python 3.11 or 3.12:

uv python pin 3.12
uv sync

Missing cmake: Audio dependencies require cmake for compilation. Install via your system package manager.

Voxtral Model Issues: If you encounter issues with the Voxtral model, ensure you have the correct version of uv and that the model is downloaded correctly:

uv sync
uv pip install git+https://github.com/huggingface/transformers
uv pip install --upgrade "mistral-common[audio]"

# Or shortcut
just venv
source .venv/bin/activate  # On Linux/macOS
# .venv\Scripts\activate  # On Windows

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github		.github
.vscode		.vscode
audio		audio
notebooks		notebooks
output		output
src		src
tests		tests
.gitignore		.gitignore
.markdownlint.yaml		.markdownlint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CLAUDE.md		CLAUDE.md
Justfile		Justfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

License

PovertyAction/audio-transcription

Folders and files

Latest commit

History

Repository files navigation

Audio Transcription Project

Usage

Quick Start

Command Line Interface

Available Models

Whisper Models (OpenAI)

Voxtral Models (Mistral AI) - Optional

Output Formats

Usage Examples

Input Requirements

Output Structure

Performance and GPU Support

Incremental Processing

Error Handling

Integration Examples

Voxtral Model Setup

Prerequisites for Voxtral

Installation Steps

Voxtral vs Whisper Comparison

Troubleshooting Voxtral

Development set up

Prerequisites

Installation

1. Install System Dependencies

2. Clone and Setup Project

3. Environment Setup

4. Activate Environment

Development Workflow

Testing

Test Structure

Running Tests

Test Categories

Adding Real Audio Files for Integration Testing

Test Design Principles

Code Quality

Code Formatting and Linting

Pre-commit Hooks

Code Style Standards

IDE Setup

Demo Notebooks

Available Demos

Running the Demos

Contributing

Development Workflow

Adding New Features

Code Review Checklist

Troubleshooting

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages