OLMoASR: Open Models and Data for Training Robust Speech Recognition Models

This repository serves to illustrate the steps taken to train OLMoASR models, all the way from the initial data processing to evaluating the model.

Data

Before starting the Quickstart tutorial, you'll need to download the data (audio-transcript pairs) and organize it in a directory structure as elaborated below to continue the data processing step:

shard_00000/
├── pair_id_1/
│   ├── audio_pair_id_1.ext
│   └── transcript_pair_id_1.ext
├── pair_id_2/
│   ├── audio_pair_id_2.ext
│   └── transcript_pair_id_2.ext
├── pair_id_3/
│   ├── audio_pair_id_3.ext
│   └── transcript_pair_id_3.ext
├── pair_id_4/
│   ├── audio_pair_id_4.ext
│   └── transcript_pair_id_4.ext
└── ...

You can download the data from OLMoASR-Pool HuggingFace.

Quickstart

In the following subsections, we'll walk through how to setup, process the data, train a model and evaluate it.

Setup

To have full access, ensure you have python >= 3.8 and a virtual environment. Then, run:

git clone https://github.com/allenai/OLMoASR.git
pip install -r requirements/requirements.txt
pip install -e .

We use ffmpeg in data processing and wandb to log training, so please ensure that you have those dependencies fulfilled.

Data Processing and Filtering

Once you've downloaded and organized your data, you'll need to follow the following steps to process your data:

Transform all your transcripts into JSONL format to be suitable for tagging and filtering using scripts/data/processing/text_to_jsonl.py
Segment all your full-length audio files into 30s-long audio chunks using olmoasr/preprocess.py
Perform document-level tagging using scripts/data/filtering/data_tagger.py
Segment transcript files into 30s-long transcript chunks using olmoasr/preprocess.py
Perform segment-level tagging using scripts/data/filtering/data_tagger.py
Perform audio-text language alignment using scripts/data/filtering/assign_audio_lang_data.py, scripts/data/filtering/tag_audio_lang.py and scripts/data/filtering/data_tagger.py
Filter based on a specified configuration of conditions using scripts/data/filtering/process_tagged_data.py
(Optional) Randomly subsample from filtered data mix to get training data

Steps 2 and 3 can be performed concurrently if you have the available compute. Step 6 is technically a tagging task as well, but involves more complex steps than heuristics-based tagging.

Your data should be a JSONL file where each line is in the following format:

{
	"id": <str>, # unique identifier for audio-transcript pair
	"seg_id": <str>, # unique identifier for segment audio-transcript pair
	"subtitle_file": <str>, # path where transcript file is located (segmented/unsegmented depending on which step of data processing you're at)
	"audio_file": <str>, # path where audio file is located (segmented/unsegmented depending on which step of data processing you're at)
	"timestamp": <str>, # start and end times of segment
	"mach_timestamp": <str>, # optional - if you have an associated machine transcript, start and end times of associated machine transcript segment
	"seg_text": <str> , # cleaned text in the transcript segment
	"mach_seg_text": <str>, # optional - cleaned text in machine transcript segment
	"seg_content": <str>, # raw text in the transcript segment
	"mach_seg_content": <str>, # raw text in machine transcript segment
	"edit_dist": <float>, # optional - document-level WER between machine and manually-uploaded transcript
	"seg_edit_dist": <float>, # optional - segment-level WER between machine and manually-uploaded transcript
	"audio_lang": <str>, # language in audio
	"text_lang": <str>, # language in transcript
	"casing": <str>, # optional - dominant casing of transcript
	"repeating_lines": <bool>, # optional - presence of repeating lines in transcript
	"length": <float>, # duration of audio
	"num_words": <int>, # number of words in transcript (cleaned text)
	"seg_num_words": <int> # number of words in transcript segment (cleaned text)
}

Training

Once you've processed your data, you are ready to train a model with it. To enable distributed training, we use torchrun. Below is an example of a bash script you'll use to execute distributed training:

# REPLICAS - number of compute nodes
# GPU_COUNT - number of GPUs
# SCRIPT - train (DDP) or train (FSDP)
torchrun --nnodes ${REPLICAS}:${REPLICAS} --nproc_per_node ${GPU_COUNT} ${SCRIPT} \
      --model_variant=${MODEL_SIZE} \ # size of model you're training
      --exp_name=${EXP_NAME} \ # experiment name
      --job_type=${JOB_TYPE} \ # type of job (e.g debug, filtering, tuning)
      --samples_dicts_dir=${SAMPLES_DICTS_DIR} \ # directory where data lives
      --train_steps=${TRAIN_STEPS} \ # total steps for training
      --epoch_steps=${EPOCH_STEPS} \ # steps for training per epoch
      --ckpt_file_name=None \ # KEEP None, this will be automatically generated
      --ckpt_dir=${CKPT_DIR} \ # where to save the checkpoint
      --log_dir=${LOG_DIR} \ # where to log wandb and other things you want to log
      --eval_dir=${EVAL_DIR} \ # directory where eval datasets live
      --run_id_dir=${RUN_ID_DIR} \ # directory where wandb run_ids are cached
      --lr=${LEARNING_RATE} \ # learning rate
      --betas=${BETAS} \ # beta values
      --eps=${EPS} \ # epsilon value
      --weight_decay=${WEIGHT_DECAY} \ # weight decay value
      --max_grad_norm=${MAX_GRAD_NORM} \ # max clipping grad norm
      --eff_batch_size=${EFFECTIVE_BATCH_SIZE} \ # global batch size (across GPUs)
      --train_batch_size=${BATCH_SIZE} \ # per GPU batch size
      --eval_batch_size=${EVAL_BATCH_SIZE} \ # per GPU batch size for running evals
      --num_workers=${NUM_WORKERS} \ # number of dataloader workers
      --prefetch_factor=${PREFETCH_FACTOR} \ # prefetch factor
      --pin_memory=${PIN_MEMORY} \ # whether to pin memory
      --shuffle=${SHUFFLE} \ # shuffle data in DistributedSampler
      --persistent_workers=${PERSISTENT_WORKERS} \ # whether to have persistent workers
      --run_eval=${RUN_EVAL} \ # whether to run evaluation in training loop
      --train_log_freq=${TRAIN_LOG_FREQ} \ # frequency to log training results to wandb
      --eval_freq=${EVAL_FREQ} \ # frequency to run evaluation in loop
      --ckpt_freq=${CKPT_FREQ} \ # frequency to save checkpoints
      --verbose=${VERBOSE} \ # verbose setting for debugging
      --precision=${PRECISION} \ # precision type
      --hardware=${HARDWARE} \ # type of hardware training on (for efficiency tracking)
      --async_eval=${ASYNC_EVAL} \ # whether to do asynchronous evaluation
      --eval_script_path=${EVAL_SCRIPT_PATH} \ # path to evaluation script (for async eval)
      --eval_wandb_log=${EVAL_WANDB_LOG} \ # whether to log to wandb for evals (for async eval)
      --eval_on_gpu=${EVAL_ON_GPU}" # whether to run async eval on GPU or CPU

You can go to configs/job_configs/training for a more detailed guide on the bash scripts that use torchrun to train and some example training scripts.

Evaluation

To run evaluation, you'll have to acquire the evaluation sets first. With the exception of evaluation sets that need to be paid for and Artie Bias Corpus¹, you can use scripts/eval/get_eval_set.py to download the dataset by just passing in the dataset name.

After that, you can run scripts/eval/eval.py to run evaluation. Please visit scripts/eval for more information on the evaluation sets, and other scripts.

Available Models

OLMoASR is a series of ASR models trained on a randomly subsampled version of OLMoASR-Mix, a web-scale 1M hour audio-text dataset collected from the public internet. They can all perform English short and long-form speech recognition and produce sentence-level timestamps.

Model checkpoints can be downloaded from OLMoASR HuggingFace.

Short-form Speech Recognition

Dataset	OLMoASR-tiny.en	OLMoASR-base.en	OLMoASR-small.en	OLMoASR-medium.en	OLMoASR-large.en	OLMoASR-large.en-v2
Librispeech-test.clean	5.1	3.7	3.0	3.5	2.6	2.7
Librispeech-test.other	12.3	9.0	7.0	5.7	5.9	5.6
TED-LIUM3	5.5	4.6	4.2	5.0	4.5	4.2
WSJ	5.6	4.3	3.8	3.6	3.7	3.6
CallHome	23.9	20.5	16.7	14.3	16.5	15.0
Switchboard	18.7	14.0	13.2	12.7	12.7	11.7
CommonVoice5.1	25.1	18.5	13.1	11.3	11.1	11.1
Artie	19.3	13.6	9.6	7.5	7.9	7.8
CORAAL	25.7	21.5	19.6	18.7	18.7	18.1
CHiME6	45.2	38.0	30.6	28.5	30.7	29.4
AMI-IHM	24.2	20.4	18.7	16.9	16.4	17.1
AMI-SDM	55.4	47.8	39.9	38.3	38.8	38.0
VoxPopuli	11.6	9.7	8.7	8.4	8.1	8.0
Fleurs	9.7	6.7	5.0	4.4	4.5	4.2
Average	20.5	16.6	13.8	12.8	13.0	12.6

Long-form Speech Recognition

Dataset	OLMoASR-tiny.en	OLMoASR-base.en	OLMoASR-small.en	OLMoASR-medium.en	OLMoASR-large.en	OLMoASR-large.en-v2
TED-LIUM3	4.8	3.9	3.6	3.3	3.5	3.6
Meanwhile	12.6	10.2	7.4	6.9	8.8	10.0
Kincaid46	13.6	11.2	10.2	9.4	10.0	10.1
Rev16	14.0	12.0	11.5	12.5	11.5	11.1
Earnings-21	14.2	11.1	10.1	9.5	9.9	9.8
Earnings-22	20.0	15.6	14.0	13.5	13.5	13.5
CORAAL	30.2	26.1	23.4	21.9	22.4	22.1
Average	15.6	12.9	11.5	11.0	11.4	11.5

Usage

Currently, only Python usage is supported. CLI usage support is in development. To run transcription, you can run the code below:

import olmoasr

model = olmoasr.load_model("medium", inference=True)
result = model.transcribe("audio.mp3")
print(result)
# Result schema:
{
  "type": "object",
  "properties": {
    "text": {
      "type": "string"
    },
    "segments": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "id": { "type": "integer" },
          "seek": { "type": "integer" },
          "start": { "type": "number" },
          "end": { "type": "number" },
          "text": { "type": "string" },
          "tokens": {
            "type": "array",
            "items": { "type": "integer" }
          },
          "temperature": { "type": "number" },
          "avg_logprob": { "type": "number" },
          "compression_ratio": { "type": "number" },
          "no_speech_prob": { "type": "number" }
        },
        "required": [
          "id",
          "seek",
          "start",
          "end",
          "text",
          "tokens",
          "temperature",
          "avg_logprob",
          "compression_ratio",
          "no_speech_prob"
        ],
        "additionalProperties": false
      }
    },
    "language": {
      "type": "string"
    }
  },
  "required": ["text", "segments", "language"],
  "additionalProperties": false
}

Team and Acknowledgements

Team (* = equal contrib): Huong Ngo, Matt Deitke, Martijn Bartelds, Sarah Pratt, Josh Gardner*, Matt Jordan*, Ludwig Schmidt*

Code is developed with the assistance of OpenAI's Whisper code. We are grateful to Ai2 and UW for resource support, OpenAI for open-sourcing a portion of their code and making their pre-trained checkpoints available, and Jong Wook Kim for clarifications throughout the project.

License

Citing

Coming soon.

This dataset no longer exists online from the original source. If you'd like a copy of the evaluation set, please visit OLMoASR HuggingFace ↩

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github		.github
configs		configs
demo		demo
notebooks		notebooks
olmoasr		olmoasr
requirements		requirements
scripts		scripts
tech_report		tech_report
.gitignore		.gitignore
HISTORY.md		HISTORY.md
LICENSE		LICENSE
Makefile		Makefile
NOTICES.txt		NOTICES.txt
README.md		README.md
mypy.ini		mypy.ini
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OLMoASR: Open Models and Data for Training Robust Speech Recognition Models

Contents

Data

Quickstart

Setup

Data Processing and Filtering

Training

Evaluation

Available Models

Short-form Speech Recognition

Long-form Speech Recognition

Usage

Team and Acknowledgements

License

Citing

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

allenai/OLMoASR

Folders and files

Latest commit

History

Repository files navigation

OLMoASR: Open Models and Data for Training Robust Speech Recognition Models

Contents

Data

Quickstart

Setup

Data Processing and Filtering

Training

Evaluation

Available Models

Short-form Speech Recognition

Long-form Speech Recognition

Usage

Team and Acknowledgements

License

Citing

Footnotes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages