Python project for audio transcription using OpenAI Whisper and Mistral Voxtral models.
This project provides a command-line tool for transcribing audio files using OpenAI's Whisper models and Mistral's Voxtral models. You can transcribe individual files or batch process entire directories with support for multiple output formats.
After setting up the environment (see Development setup below):
# Activate the environment
uv sync
# Transcribe all audio files in the audio/ directory to CSV
uv run python src/transcribe_audio.py
# Or use specific options
uv run python src/transcribe_audio.py --model whisper-tiny --format json
uv run python src/transcribe_audio.py --model whisper-tiny --format duckdb
The transcription script supports several command-line options:
uv run python src/transcribe_audio.py [OPTIONS]
Available Options:
--model MODEL
: Choose the transcription model - Whisper or Voxtral (default:whisper-small
)--format FORMAT
: Output format for results (default:csv
)--language LANGUAGE
: Language code for transcription (default:en
). Whisper supports 99 languages, Voxtral supports 8.--max-new-tokens TOKENS
: Maximum number of tokens to generate (default:400
). Whisper models have a maximum limit of 448 tokens.--input-path PATH
: Directory containing audio files (default:./audio
)--output-path PATH
: Directory for output files (default:./output
)--all-audio
: Re-process all files, including previously transcribed ones
Choose from different Whisper models based on your speed vs accuracy needs:
Model | Description | Size | Use Case |
---|---|---|---|
whisper-tiny |
Fastest model, least accurate | ~39 MB | Quick testing, real-time |
whisper-small |
Fast model, good accuracy | ~244 MB | Recommended default |
whisper-medium |
Balanced speed/accuracy | ~769 MB | High-quality transcription |
whisper-large-v3-turbo |
Best accuracy, slower | ~1550 MB | Maximum quality needed |
Whisper Language Support: Supports 99 languages including English, Spanish, French, German, Chinese, Japanese, Korean, Arabic, Hindi, and many more. Use ISO 639-1 language codes (e.g., en
, es
, fr
, de
, zh
, ja
, ko
, ar
, hi
).
For multilingual speech recognition with advanced capabilities:
Model | Description | Size | Use Case |
---|---|---|---|
voxtral-mini |
Multilingual ASR model | ~3B params | Fast multilingual transcription |
voxtral-small |
High-quality multilingual ASR | ~24B params | Best multilingual accuracy |
Voxtral Language Support: Currently supports 8 languages: English (en
), Spanish (es
), French (fr
), Portuguese (pt
), Hindi (hi
), German (de
), Dutch (nl
), and Italian (it
).
Note: Voxtral models require additional dependencies. See Voxtral Setup below.
Save your transcriptions in multiple formats:
Format | Extension | Description | Best For |
---|---|---|---|
csv |
.csv |
Comma-separated values | Excel, data analysis |
json |
.json |
JavaScript Object Notation | Web applications, APIs |
parquet |
.parquet |
Apache Parquet columnar | Big data, analytics |
duckdb |
.duckdb |
DuckDB database | SQL queries, complex analysis |
Basic transcription:
# Transcribe all audio files with default settings
uv run python src/transcribe_audio.py
Choose model and format:
# Use tiny Whisper model for fast processing, save as JSON
uv run python src/transcribe_audio.py --model whisper-tiny --format json
# Use large Whisper model for best quality, save to database
uv run python src/transcribe_audio.py --model whisper-large-v3-turbo --format duckdb
# Use Voxtral model for multilingual transcription (requires setup)
uv run python src/transcribe_audio.py --model voxtral-mini --format json
Language-specific transcription:
# Transcribe Spanish audio
uv run python src/transcribe_audio.py --language es
# Transcribe French audio with more tokens for detailed output
uv run python src/transcribe_audio.py --language fr --max-new-tokens 600
# Use Voxtral for German transcription
uv run python src/transcribe_audio.py --model voxtral-mini --language de
# Transcribe Hindi audio with Whisper (Voxtral also supports Hindi)
uv run python src/transcribe_audio.py --language hi --model whisper-medium
Token length control:
# Short summaries (fewer tokens)
uv run python src/transcribe_audio.py --max-new-tokens 200
# Default length
uv run python src/transcribe_audio.py # Uses 400 tokens
# Longer transcriptions (Whisper max is 448)
uv run python src/transcribe_audio.py --max-new-tokens 448
# Voxtral models can use more tokens
uv run python src/transcribe_audio.py --max-new-tokens 800 --model voxtral-mini
Re-process all files:
# Force re-transcription of all files (ignores existing results)
uv run python src/transcribe_audio.py --all-audio --format parquet
Production batch processing:
# Process large batches with balanced Whisper model
uv run python src/transcribe_audio.py --model whisper-medium --format duckdb
# Process multilingual content with Voxtral
uv run python src/transcribe_audio.py --model voxtral-small --format parquet --all-audio
# Batch process Spanish content with custom token limit
uv run python src/transcribe_audio.py --model whisper-medium --language es --max-new-tokens 500 --format duckdb
# High-quality multilingual processing
uv run python src/transcribe_audio.py --model voxtral-small --language fr --max-new-tokens 600 --format parquet
# Process files from custom directories
uv run python src/transcribe_audio.py --input-path /custom/audio --output-path /custom/results
# Batch process with custom paths and settings
uv run python src/transcribe_audio.py --input-path ~/recordings --output-path ~/transcriptions --model whisper-medium --format duckdb
Supported Audio Formats:
- MP3 (
.mp3
) - WAV (
.wav
) - FLAC (
.flac
) - M4A (
.m4a
) - OGG (
.ogg
)
File Organization:
- Place audio files in the
audio/
directory (or specify custom path with--input-path
) - The script automatically discovers all supported audio files
- Files are processed in alphabetical order
- Results are saved to the
output/
directory (or specify custom path with--output-path
)
All transcription results include:
- File ID: Unique identifier based on filename and size
- Filename: Original audio file name
- File Size: Size in bytes
- Transcription Time: Processing duration in seconds
- Transcription Text: The actual transcribed text
- Model ID: Which Whisper model was used
- Timestamps: When processing started and completed
Example CSV Output:
file_id,filename,file_size_bytes,transcription_time_seconds,transcription_text,model_id,started_at,processed_at
a1b2c3d4e5f6g7h8,sample.mp3,1048576,2.34,"Hello world, this is a test recording.",openai/whisper-small,2024-01-01T12:00:00Z,2024-01-01T12:00:02Z
Automatic Device Detection:
- Uses CUDA GPU if available for faster processing
- Falls back to CPU automatically
- Model precision adjusted based on device (float16 for GPU, float32 for CPU)
Processing Speed Examples:
- CPU: ~5-10x real-time (10 second audio = 50-100 seconds processing)
- GPU: ~20-50x real-time (10 second audio = 5-20 seconds processing)
- Actual speed varies by model size and hardware
The script avoids re-processing files by:
- Generating unique file IDs based on filename + file size
- Checking existing results in the output file
- Skipping previously processed files (unless
--all-audio
is used) - Appending new results to existing output files
This makes it efficient for processing large directories incrementally.
The script handles various error conditions gracefully:
- Missing audio directory: Creates directory if needed
- Unsupported file formats: Skips with warning
- Corrupted audio files: Continues processing other files
- Model loading errors: Provides clear error messages
- Disk space issues: Fails gracefully with error details
Using with other tools:
# Process audio and analyze results with DuckDB
uv run python src/transcribe_audio.py --format duckdb
echo "SELECT model_id, AVG(transcription_time_seconds) FROM transcriptions GROUP BY model_id;" | duckdb output/transcribed_audio.duckdb
# Export to CSV for Excel analysis
uv run python src/transcribe_audio.py --format csv
# Open output/transcribed_audio.csv in Excel
# Process and convert to different format
uv run python src/transcribe_audio.py --format parquet
# Use pandas/polars to read parquet file in data science workflows
Monitoring progress:
# Watch processing in real-time
uv run python src/transcribe_audio.py --model whisper-small --format json 2>&1 | tee transcription.log
To use Mistral's Voxtral models for multilingual speech recognition, you need to install additional dependencies.
Voxtral models require the latest development version of the transformers
library and additional audio processing dependencies.
-
Install development transformers (required for Voxtral support):
uv pip install git+https://github.com/huggingface/transformers
-
Install Mistral audio dependencies:
uv pip install --upgrade "mistral-common[audio]"
-
Activate the environment and verify installation:
source .venv/bin/activate python src/transcribe_audio.py --help | grep -A 10 "Available models:"
You should see both Whisper and Voxtral models listed if installation was successful.
Important: You must activate the virtual environment with
source .venv/bin/activate
before testing Voxtral models to ensure proper dependency resolution.Note: These extra installation steps may become obsolete once Voxtral models are available in a future stable release of HuggingFace transformers.
Feature | Whisper | Voxtral |
---|---|---|
Languages | 99+ languages | 8 languages (en, es, fr, pt, hi, de, nl, it) |
Model Size | 39MB - 1.5GB | 3B - 24B parameters |
Speed | Fast to moderate | Moderate to slow |
Accuracy | High for English | Very high for supported languages |
Dependencies | Standard transformers | Development transformers + mistral-common |
Use Case | General transcription | Advanced multilingual ASR |
Token Control | Yes (--max-new-tokens) | Yes (--max-new-tokens) |
If Voxtral models don't appear:
- Ensure you installed the development version of transformers
- Check that mistral-common[audio] is properly installed
- Restart your environment after installation
If you get import errors:
# Clean reinstall
uv pip uninstall transformers mistral-common
uv pip install git+https://github.com/huggingface/transformers
uv pip install --upgrade "mistral-common[audio]"
Performance considerations:
- Voxtral models require more GPU memory than Whisper
- Use
voxtral-mini
for faster processing - Use
voxtral-small
only with sufficient GPU memory (>8GB recommended)
This project requires Python 3.11 or 3.12 (not 3.13 due to dependency constraints) and the following tools:
- Python 3.11-3.12: Required for compatibility with audio processing dependencies
- uv: Modern Python package manager for dependency management
- just: Command runner for common development tasks
- cmake: Required for building audio processing dependencies
- git: Version control
On macOS/Linux with Homebrew:
brew install just uv cmake
On Windows:
winget install Casey.Just astral-sh.uv
# Install cmake via Visual Studio Build Tools or from cmake.org
On Linux (Ubuntu/Debian):
# Install via Homebrew
brew install just uv
# Install other tools via apt
sudo apt install cmake build-essential pkg-config
git clone <repository-url>
cd audio-transcription
Quick setup (recommended):
just get-started
Manual setup:
# Create virtual environment with correct Python version
uv venv --python 3.12
uv sync
The project uses uv
for environment management:
# Activate the environment
uv shell
# Or run commands directly with uv
uv run python your_script.py
uv run jupyter lab
Common development commands using just
:
just lab # Launch Jupyter Lab
just lint-py # Lint Python code
just fmt-python # Format Python code
just fmt-all # Format all code (Python, SQL, Markdown)
just pre-commit-run # Run pre-commit hooks
just test # Run core test suite
just test-cov # Run tests with coverage report
This project includes a comprehensive test suite with 74 passing tests that validate all aspects of the audio transcription functionality.
tests/
├── assets/ # Test audio files for integration tests
│ └── audio/ # Real audio files (add your own)
├── integration/ # Integration tests with real files
│ └── test_real_audio.py # Tests using actual audio files
├── unit/ # Fast unit tests (mocked)
│ ├── test_file_operations.py # ✅ File ID, audio discovery, sizes
│ ├── test_data_formats.py # ✅ CSV, JSON, Parquet, DuckDB ops
│ ├── test_model_loading.py # ✅ Whisper model loading & validation
│ ├── test_transcription.py # ✅ Core transcription with mocks
│ ├── test_cli.py # ✅ CLI argument parsing
│ └── test_error_handling.py # ⚠️ Error handling (3 edge cases)
└── conftest.py # Shared fixtures and test utilities
Quick Testing (Recommended):
# Run core working tests (fast, ~6 seconds)
just test
# Or: uv run python -m pytest
Comprehensive Testing:
# Run all working unit tests
just test-unit
# Run integration tests with real audio files
just test-integration
# Run slow/comprehensive tests
just test-slow
# Run ALL tests (including broken ones for debugging)
just test-all
# Run only broken tests for debugging
just test-broken
Coverage Reports:
# Terminal coverage report
just test-cov
# HTML coverage report (opens in browser)
just test-cov-html
# XML coverage report (for CI)
just test-cov-xml
✅ Working Tests (74 tests):
- File Operations (12 tests): File ID generation, audio discovery, file sizes
- Data Formats (22 tests): Save/load operations for CSV, JSON, Parquet, DuckDB
- Model Loading (13 tests): Whisper model validation, device handling
- Transcription Core (11 tests): Audio processing with mocked models
- CLI Interface (16 tests): Command-line argument parsing and workflow integration
- Error Handling: 3 filesystem permission/error detection edge cases
-
Place audio files in
tests/assets/audio/
-
Keep files small (< 1MB each, 1-30 seconds duration)
-
Document sources in
tests/assets/README.md
-
Run integration tests:
just test-integration
Integration tests will skip gracefully if no audio files are present.
- Fast by default: Core tests run in ~6 seconds using mocks
- No model downloads: Uses mocked ML models to avoid heavy downloads
- Graceful skipping: Integration tests skip when real audio files unavailable
- Comprehensive coverage: Tests all output formats and error scenarios
- CI-ready: Provides detailed coverage reports for continuous integration
The project enforces high code quality standards through automated tools:
# Format all code (Python, SQL, Markdown)
just fmt-all
# Format only Python code
just fmt-python
# Lint Python code with ruff
just lint-py
# Run all pre-commit hooks
just pre-commit-run
The project uses comprehensive pre-commit hooks that run automatically before each commit:
- File validation: YAML, JSON, TOML syntax checking
- Python validation:
validate-pyproject
for pyproject.toml - Spell checking:
codespell
with custom ignore list - Markdown formatting:
markdownlint-fix
with auto-fixing - Python formatting:
ruff-format
for consistent code style - Python linting:
ruff-check
with comprehensive rule set
Setup pre-commit hooks:
# Install pre-commit hooks (run once)
uv run pre-commit install
# Run hooks manually on all files
just pre-commit-run
# Update hook versions
just update-reqs
- Line length: 88 characters (follows Black standard)
- Python target: 3.12 compatibility
- Docstrings: Required for public functions and classes
- Type hints: Encouraged but not enforced
- Import sorting: Automatic via ruff
- Formatting: Automatic via ruff (replaces Black)
VS Code (recommended):
- Install Python extension
- Set Python interpreter to
.venv/bin/python
(or.venv/Scripts/python.exe
on Windows) - Install Jupyter extension for notebook support
- Install Python Test Explorer for integrated test running
The project includes demonstration notebooks showcasing different transcription models:
-
notebooks/demo_whisper_transcription.ipynb
: Demonstrates audio transcription using OpenAI's Whisper model- Uses the
openai/whisper-small
model for faster inference - Includes interactive audio players for testing
- Shows transcription accuracy comparison with original text
- Uses the
-
notebooks/demo_voxtral_transcription.ipynb
: Demonstrates audio transcription using Mistral's Voxtral model- Uses the
mistralai/Voxtral-Mini-3B-2507
model - Multilingual speech recognition capabilities
- Interactive audio processing with timing metrics
- Uses the
# Launch Jupyter Lab
just lab
# Or run directly with uv
uv run jupyter lab
Both notebooks include sample audio files and demonstrate:
- Model loading and setup
- Audio file processing with interactive players
- Transcription accuracy comparison
- Performance timing metrics
-
Fork and clone the repository
-
Set up development environment:
just get-started uv shell
-
Install pre-commit hooks:
uv run pre-commit install
-
Create a feature branch:
git checkout -b feature/your-feature-name
-
Make your changes following the code style standards
-
Run tests to ensure everything works:
just test just lint-py
-
Commit your changes (pre-commit hooks will run automatically):
git add . git commit -m "Add your descriptive commit message"
-
Push and create a pull request
When adding new functionality:
- Write tests first (TDD approach recommended)
- Update documentation in README.md and CLAUDE.md
- Ensure code quality by running
just fmt-all
andjust lint-py
- Add integration tests if working with real audio files
- Update notebooks if the feature affects demo functionality
- Tests pass (
just test
) - Code is formatted (
just fmt-all
) - Code is linted (
just lint-py
) - Documentation is updated
- Pre-commit hooks pass
- Integration tests work (if applicable)
Python 3.13 Issues:
If you encounter errors with sentencepiece
or other dependencies, ensure you're using Python 3.11 or 3.12:
uv python pin 3.12
uv sync
Missing cmake: Audio dependencies require cmake for compilation. Install via your system package manager.
Voxtral Model Issues:
If you encounter issues with the Voxtral model, ensure you have the correct version of uv
and that the model is downloaded correctly:
uv sync
uv pip install git+https://github.com/huggingface/transformers
uv pip install --upgrade "mistral-common[audio]"
# Or shortcut
just venv
source .venv/bin/activate # On Linux/macOS
# .venv\Scripts\activate # On Windows