Batch Transcription

Transcribe speech 100x faster and 100x cheaper with Modal and NeMo ASR models.

Setup

Clone this repo
Install uv
Build the virtual environment: uv sync
Setup your Modal account (modal setup)
Add a Modal API token to your environment if necessary (modal token new)

Models

Any NeMo ASR model should work. Though it may be necessary to handle model specific kwargs to transcribe.

We've tested the following models:

nvidia/parakeet-tdt-0.6b-v2 (default): Hyperfast, English only transcription
nvidia/canary-1b-flash: Just regular fast multilingual transcription

Model weight caching

The first run for each model will incur a small latency cost to download the weights to cache. Subsequent runs will load the weights from the Modal Volume: transcription-models.

Usage

Download ESB Test Datasets

First stage the data (one-time setup) on the Modal Volume: transcription-datasets:

modal run -m run::stage_data

This downloads audio files from the HuggingFace ESB test subsets: AMI, Earnings22, GigaSpeech, LibriSpeech (clean/other), SPGISpeech, TEDLIUM, VoxPopuli.

Run Batch Transcription

modal run -m run::batch_transcription

Or run with arguments:

modal run -m run::batch_transcription \
  --model_id nvidia/parakeet-tdt-0.6b-v2 \
  --gpu-type L40S \
  --gpu-batch-size 128 \
  --num-requests 25 \
  --job-id my-transcription-job

Configuration Options

Argument	Default	Description
`--model_id`	`nvidia/parakeet-tdt-0.6b-v2`	NeMo ASR model identifier
`--gpu-type`	`L40S`	GPU type for transcription function
`--gpu-batch-size`	`128`	Number of audio files per GPU batch
`--num-requests`	`25`	Number of parallel Modal function calls
`--output-path`	`results`	Path for results directory
`--job-id`	Auto-generated if not provided	Job identifier

Output

Results are saved to the Modal Volume, transcription-results, in two formats:

Summary: /results_summaries/results_summary_{job_id}.csv
- Aggregated metrics (WER, RTFX, timing)
Detailed: /results/{job_id}.csv
- Individual transcriptions, ground truth, dataset info

Metrics

WER: Word Error Rate (%) calculated using normalized text for one request
RTFX: Real-time factor (audio duration / processing time) for one request
Total Runtime: End-to-end job execution time for whole job

`normalizer`

The normalizer module in this repo used to process text and score WER is pulled from the HuggingFace ASR Leaderboard.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
app		app
utils		utils
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
run.py		run.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Batch Transcription

Setup

Models

Model weight caching

Usage

Download ESB Test Datasets

Run Batch Transcription

Configuration Options

Output

Metrics

`normalizer`

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

modal-labs/open-batch-transcription

Folders and files

Latest commit

History

Repository files navigation

Batch Transcription

Setup

Models

Model weight caching

Usage

Download ESB Test Datasets

Run Batch Transcription

Configuration Options

Output

Metrics

normalizer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

`normalizer`

Packages