Welcome to the repository for "Whisper-LM," an extension to OpenAI's Whisper models that integrates n-gram and large language models (LM) to enhance automatic speech recognition (ASR) performance, particularly for low-resource languages. This repository contains scripts and tools used in our research and can also be adapted for other languages or models.
For those looking to fine-tune Whisper models specifically, we recommend starting with the Whisper Fine-Tuning Event scripts provided by Hugging Face. However, feel free to use your own fine-tuned models with this code.
For users interested in a transformers library-compatible implementation, visit whisper-lm-transformers. That repository reimplements the functionality using the transformers Whisper model, and is more user friendly. Note that results may vary slightly due to minor differences in internal workings.
For result reproduction, make sure the correct version of Whisper is installed, and use the requirements file used with Python 3.8:
pip install -U openai_whisper==20230918
pip install -r requirements.txt
If your need a more customized installation, these are the required packages:
datasets
jiwer
kenlm==0.2.0
librosa
numpy
openai_whisper==20231117
optuna
: used by thelm_optimize.py
script.tabulate
torch
tqdm
transformers
Here, we are going to perform a simple transcription using Whisper with an LM.
Start by downloading the desired LM:
wget -O 5gram-eu.bin https://aholab.ehu.eus/~xzuazo/models/Basque%20LMs/5gram.bin
This is step is only needed if you want to use a model in Hugging Face format. It needs to be converted back to the OpenAI format before using it here.
For example, to use the fine-tuned Tiny size Whisper model from zuazo/whisper-tiny-eu, we convert it to Open AI format:
./convert_hf_to_openai.py \
--checkpoint zuazo/whisper-tiny-eu \
--whisper_dump_path zuazo-whisper-tiny-eu.pt
Finally, to perform a simple transcription using the converted model and an LM:
>>> import whisper
>>> # Hack Whisper to support LM and load the options interface to set it up:
>>> from whisper_decoder_with_lm import LMOptions
>>> # Select an audio file:
>>> audio_path = "tests/fixtures/euf_07973_00797482883.mp3"
>>> # Set original Whisper transcription options (this is important):
>>> decode_options = {
... "language": "eu",
... "without_timestamps": True,
... "temperature": 0.0,
... "beam_size": 5,
... }
>>> transcribe_options = {"task": "transcribe", **decode_options}
>>> # Set LM-specific options:
>>> LMOptions().lm_path = "5gram-eu.bin"
>>> LMOptions().lm_alpha = 0.33582369
>>> LMOptions().lm_beta = 0.68825565
>>> # Load the model and transcribe the audio:
>>> model = whisper.load_model("zuazo-whisper-tiny-eu.pt")
>>> result = model.transcribe(audio_path, **transcribe_options)
>>> result["text"]
'Talka diskoetxearekin grabatzen ditut beti, abestien maketak.'
To use a large language model (LLM) we have the llm_path
argument, with
exactly the same syntax, together with the same lm_alpha
and lm_beta
parameters. This parameter supports Hugging Face model names:
>>> # Set LLM-specific options:
>>> LMOptions().llm_path = "HiTZ/latxa-7b-v1.2"
>>> LMOptions().lm_alpha = 2.73329396
>>> LMOptions().lm_beta = 0.00178595
To see a more complete example of how to use an LM with Whisper, check the
whisper_evaluate.py
script that is used to generate the evaluations in Common Voice, or other
datasets hosted in Hugging Face. There is also the
whisper_evaluate_external.py
script that is used to evaluate the models in datasets outside the Hugging Face
Hub.
These are the n-gram language models used in the paper:
- Basque: 5gram-eu.bin
- Galician: 5gram-gl-27M.bin
- Catalan: 5gram-ca-27M.bin
- Spanish: 5gram-es-27M.bin
And these are the large language models:
- Basque: HiTZ/latxa-7b-v1.2
- Galician: proxectonos/Carballo-cerebras-1.3B
- Catalan: projecte-aina/FLOR-6.3B
- Spanish: projecte-aina/FLOR-6.3B
This codebase is structured to facilitate reproduction of our results and to aid others in extending Whisper models with LMs for additional languages. Here's how you can use this repository:
Instructions and scripts for fine-tuning are available here. This process should be done prior to LM integration if using non-English or underrepresented languages.
Feel free to utilize our scripts to generate text corpora:
./lm_corpora_create.sh --lang eu --opusall corpora-eu.txt
and then build language models using KenLM:
make LLANG=eu lm
or create your own KenLM model.
Keep in mind that the quality of the texts used to create the language-model considerably affect its effectiviness.
Optimize the alpha and beta parameters for the LM:
./lm_optimizer.py "zuazo-whisper-tiny-eu.pt" \
--dataset_split "train+validation" \
--dataset_name "eu" \
--language "eu" \
--beam_size 5 \
--lm_path "5gram-eu.bin" \
--n_trials 100
--journal_storage \
--n_jobs 32
We can also optimize for a large language models using the --llm_path
argument:
./lm_optimizer.py "zuazo-whisper-tiny-eu.pt" \
--dataset_split "train+validation" \
--dataset_name "eu" \
--dataset_shuffle 'True' \
--dataset_n 4000 \
--language "eu" \
--beam_size 5 \
--batch_size 16 \
--llm_path "HiTZ/latxa-7b-v1.2" \
--lm_alpha_min 0 --lm_beta_min 0 \
--lm_alpha_max 3 --lm_beta_max 3 \
--n_trials 100 \
--journal_storage \
--n_jobs 1
In this case, we will limit the jobs to 1 per GPU, because we are loading both the Whisper and the LLM model in the GPU memory. This was run in 7 NVIDIA A100-SXM4-80GB GPU.
Evaluate the performance on standard datasets or your own data:
./whisper_evaluate.py "zuazo-whisper-tiny-eu.pt" \
--dataset "mozilla-foundation/common_voice_13_0" \
--dataset_name "eu" \
--dataset_split "test" \
--language "eu" \
--beam_size 5 \
--lm_path "5gram-eu.bin" \
--lm_alpha 0.33582369 --lm_beta 0.68825565
If the dataset is not in Hugging Face, we can use the
whisper_evaluate_external.py
script:
./whisper_evaluate_external.py "zuazo-whisper-tiny-eu.pt" \
~/ahomytts \
--language "eu" \
--lm_path "5gram-eu.bin" \
--lm_alpha 0.33582369 --lm_beta 0.68825565 \
--beam_size 5
The dataset is expected to have the transcriptions in *.txt
files with the
same name as the audio files.
In the notebooks/
directory is the code used to generate the tables and
plots of the article.
Contributions are welcome! Please refer to CONTRIBUTING.md for guidelines on how to propose improvements, report issues, or submit pull requests.
If you find this helpful in your research, please cite:
@misc{dezuazo2025whisperlmimprovingasrmodels,
title={Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages},
author={Xabier de Zuazo and Eva Navas and Ibon Saratxaga and Inma Hernáez Rioja},
year={2025},
eprint={2503.23542},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.23542},
}
Please, check the related paper preprint in arXiv:2503.23542 for more details.