Transcribe speech 100x faster and 100x cheaper with Modal and NeMo ASR models.
- Clone this repo
- Install uv
- Build the virtual environment:
uv sync
- Setup your Modal account (
modal setup
) - Add a Modal API token to your environment if necessary (
modal token new
)
Any NeMo ASR model should work. Though it may be necessary to handle model specific kwargs to transcribe
.
We've tested the following models:
- nvidia/parakeet-tdt-0.6b-v2 (default): Hyperfast, English only transcription
- nvidia/canary-1b-flash: Just regular fast multilingual transcription
The first run for each model will incur a small latency cost to download the weights to cache. Subsequent runs will load the weights from the Modal Volume: transcription-models
.
First stage the data (one-time setup) on the Modal Volume: transcription-datasets
:
modal run -m run::stage_data
This downloads audio files from the HuggingFace ESB test subsets: AMI, Earnings22, GigaSpeech, LibriSpeech (clean/other), SPGISpeech, TEDLIUM, VoxPopuli.
modal run -m run::batch_transcription
Or run with arguments:
modal run -m run::batch_transcription \
--model_id nvidia/parakeet-tdt-0.6b-v2 \
--gpu-type L40S \
--gpu-batch-size 128 \
--num-requests 25 \
--job-id my-transcription-job
Argument | Default | Description |
---|---|---|
--model_id |
nvidia/parakeet-tdt-0.6b-v2 |
NeMo ASR model identifier |
--gpu-type |
L40S |
GPU type for transcription function |
--gpu-batch-size |
128 |
Number of audio files per GPU batch |
--num-requests |
25 |
Number of parallel Modal function calls |
--output-path |
results |
Path for results directory |
--job-id |
Auto-generated if not provided | Job identifier |
Results are saved to the Modal Volume, transcription-results
, in two formats:
-
Summary:
/results_summaries/results_summary_{job_id}.csv
- Aggregated metrics (WER, RTFX, timing)
-
Detailed:
/results/{job_id}.csv
- Individual transcriptions, ground truth, dataset info
- WER: Word Error Rate (%) calculated using normalized text for one request
- RTFX: Real-time factor (audio duration / processing time) for one request
- Total Runtime: End-to-end job execution time for whole job
The normalizer
module in this repo used to process text and score WER is pulled from the HuggingFace ASR Leaderboard.