Speech-Recognition-for-Low-Resource-Languages

This repository contains code to fine-tune Whisper models on custom datasets. It includes:

fine_tune_whisper.py: The core script for fine-tuning, data preprocessing, training, and saving models.
fine_tuning_example.py: An example script demonstrating how to use fine_tune_whisper.py for fine-tuning.
basque_fine_tuning.py: An example script for fine-tuning Whisper on a Basque dataset and evaluating the model performance.

📦 Requirements

Install dependencies with:
pip install -r requirements.txt

⚙️ Configuration

Update the following parameters at the top of fine_tuning_example.py to configure data paths, model selection, and training settings:

# Save directory for checkpoints
SAVE_DIR = "./SAVE_DIR"

# Whisper model variant (options: "openai/whisper-base", "openai/whisper-medium", "openai/whisper-large", "openai/whisper-large-v3")
MODEL_NAME = "openai/whisper-medium"

# Target language (must match dataset and be supported by Whisper)
LANGUAGE = "Basque"

# Training parameters
BATCH_SIZE = 24                # batch size per GPU
MAX_EPOCHS = 30                 # total training epochs
ACCUMULATE_GRAD_BATCHES = 24    # gradient accumulation steps
LEARNING_RATE = 1e-4            # optimizer learning rate

Whisper supports a fixed set of languages. You can find the full list of supported languages here. If your language is not included, select the closest supported one (e.g., you can use Swahili for Shimaore and German for Alsatian). Otherwise, you may omit the LANGUAGE parameter to enable automatic detection.

Adjust BATCH_SIZE and ACCUMULATE_GRAD_BATCHES depending on your available GPU memory. Choose a BATCH_SIZE that balances memory usage with training stability. If you encounter memory limitations, reduce the batch size and use ACCUMULATE_GRAD_BATCHES to accumulate gradients over multiple steps, allowing you to simulate a larger effective batch size while avoiding memory overflow.

For large models (whisper-large, whisper-large-v3), it is recommended to set LEARNING_RATE = 1e-5 to allow for more gradual updates, preventing abrupt changes that could disrupt pre-existing knowledge and ensuring a thorough exploration of the parameter space without skipping optimal solutions.

🧾 Dataset Format Options

You can load your dataset using one of the following two methods:

📚 Option 1: Hugging Face Dataset (Common Voice)

You can directly use Common Voice datasets available on Hugging Face. This method allows you to load and preprocess the data easily.

Example usage:

 # load and preprocess data
    raw_dataset, processed_dataset= load_and_preprocess_data(
    "mozilla-foundation/common_voice_17_0",  # Dataset name on Hugging Face
    language="eu",                           # ISO 639-1 language code
    use_huggingface=True, 
    token="hf_token",                        # Your Hugging Face access token
    feature_extractor=feature_extractor, 
    tokenizer=tokenizer
	)

📁 Option 2: Local Directory

If you prefer to use a custom dataset from a local directory, organize your files so that each audio file (.wav or .mp3) has a corresponding .txt transcription file with the same name. Directory structure example:

your_dataset/
├── sample_001.wav  
├── sample_001.txt  
├── sample_002.wav  
├── sample_002.txt  
└── ...

To load and preprocess the data:

raw_dataset, processed_dataset = load_and_preprocess_data(path="your_dataset/")

Audio Requirements

Maximum duration: 30 seconds per clip
Sampling rate: 16 kHz (audio will be automatically resampled if necessary)

To comply with Whisper's 30-second audio limit, make sure your audio files are segmented and aligned with the transcriptions. This alignment can be achieved using alignment tools such as Wav2Vec2 Forced Alignment or Montreal Forced Aligner (MFA).

Transcript Normalization

During preprocessing, all transcriptions are automatically:

Fully lowercased
Stripped of all punctuation

This normalization ensures the model focuses solely on phonetic and linguistic content, without being distracted by case or punctuation.

🚀 Training

Once parameters and data are configured, launch training with:

python fine_tuning_example.py

✅ The best-performing model will be automatically saved as model.pt in the specified SAVE_DIR.

📈 Fine-tuning Results:

We have focused on a selection of low-resource languages spoken in France, including:

Basque
Alsatian
Shimaore

The table below summarizes the results obtained after fine-tuning Whisper models on these languages. Key metrics include the size of the training data, duration of fine-tuning, and transcription performance. The Whisper Language Setting column shows the language used in Whisper for each case.

Data language	Whisper Language Setting	Base Model	Training Data (hrs)	Training Duration	WER (%)	CER (%)
Basque	Basque	whisper-medium	116h20	13h08	8.7	1.6
Alsatian	German	whisper-large-v3	9h36	18min	47.2	26.1
Shimaore	Swahili	whisper-large-v3	1h28	5min	69.2	29.1

Experiments were conducted using two Nvidia Tesla L40S GPUs, each with 45 GiB of memory.

For Basque, the model was fine-tuned using the CommonVoice Basque dataset, while private datasets were used for the other languages

To reproduce the Basque results, run:

python basque_fine_tuning.py

This script fine-tunes the Whisper model on the Basque dataset and displays the model's performance metrics (WER and CER) upon completion.

🛠️ Work in Progress

We are actively working on expanding support to additional regional and Low-Resource-Languages.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Speech-Recognition-for-Low-Resource-Languages

📦 Requirements

⚙️ Configuration

🧾 Dataset Format Options

📚 Option 1: Hugging Face Dataset (Common Voice)

📁 Option 2: Local Directory

Audio Requirements

Transcript Normalization

🚀 Training

📈 Fine-tuning Results:

🛠️ Work in Progress

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
basque_fine_tuning.py		basque_fine_tuning.py
fine_tune_whisper.py		fine_tune_whisper.py
fine_tuning_example.py		fine_tuning_example.py
requirements.txt		requirements.txt

DEFI-COLaF/Speech-Recognition-for-Low-Resource-Languages

Folders and files

Latest commit

History

Repository files navigation

Speech-Recognition-for-Low-Resource-Languages

📦 Requirements

⚙️ Configuration

🧾 Dataset Format Options

📚 Option 1: Hugging Face Dataset (Common Voice)

📁 Option 2: Local Directory

Audio Requirements

Transcript Normalization

🚀 Training

📈 Fine-tuning Results:

🛠️ Work in Progress

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages