This repository contains a complete pipeline for fine-tuning NVIDIA's NeMo QuartzNet model for Automatic Speech Recognition (ASR) using the LibriSpeech dataset, orchestrated with Valohai.
This project demonstrates how to:
- Download and preprocess the LibriSpeech dataset
- Fine-tune a pre-trained QuartzNet15x5 model
- Evaluate the model's performance
- Orchestrate the entire workflow using Valohai
prepare-dataset.py
: Downloads and preprocesses LibriSpeech datatrain.py
: Fine-tunes the QuartzNet modelevaluate.py
: Evaluates model performance using Word Error Rate (WER)valohai.yaml
: Defines the Valohai pipeline and execution stepsrequirements.txt
: Core Python dependencies
-
Prepare Dataset
- Downloads a subset of LibriSpeech ("mini" version)
- Converts FLAC files to WAV
- Creates manifest files for training, validation, and testing
-
Train Model
- Fine-tunes the pre-trained QuartzNet15x5 model
- Uses the train and validation manifests
- Configurable epochs, learning rate, and batch size
-
Evaluate Model
- Calculates Word Error Rate (WER) on the test set
- Generates predictions and compares with ground truth
This project uses LibriSpeech, a corpus of approximately 1000 hours of 16kHz read English speech. The pipeline is configured to use the "mini" subset by default, which includes:
dev-clean-2
: A small development settrain-clean-5
: A small training set (5 hours) We customized the dataset to include a test set for the evaluation step and used:test-clean
: The standard test set
The pipeline fine-tunes NVIDIA's QuartzNet15x5, a convolutional neural network for speech recognition that achieves state-of-the-art results on LibriSpeech.
This project uses code from NVIDIA's NeMo toolkit, which is licensed under the Apache License 2.0.