GitHub - avrtt/MoE-speech-recognition: Mixture of experts architecture for speech-to-text and language identification, built in PyTorch

This repository contains a speech-to-text and language identification system built in PyTorch as a mixture-of-experts architecture that combines transformers, RNNs and CNNs. It supports various audio and video formats, and can work with multiple languages and dialects.

The project was originally developed as just a pet project on speech recognition to gain some PyTorch skills. The model was trained on private data comprising thousands of hours of speech. I published it for educational purposes and (probably) to help somebody with captioning, transcription or speech analysis.

Features

Mixture architecture: combines transformer, RNN and CNN branches in a mixture-of-experts model
Multi-language support
Configuration: easily configure audio properties (frequency, sampling rate, channels, encoding) and input/output settings through a YAML config
Advanced audio processing: supports feature extraction (spectrograms, mel-spectrograms) and optional augmentation

Project structure

.
├── README.md
├── requirements.txt
├── config.yaml
├── train.py # training entry point
├── infer.py # inference script for captioning and transcription
├── data_loader.py # custom dataset and data loader definitions
├── model/ # contains all model definitions
│   ├── __init__.py
│   ├── base_model.py
│   ├── transformer_module.py
│   ├── rnn_module.py
│   ├── cnn_module.py
│   └── mixture_of_experts.py
├── utils/ # utility modules for logging, config management, audio processing and metrics
│   ├── logger.py
│   ├── config.py
│   ├── audio_processing.py
│   └── metrics.py
└── tests/ # unit tests for model, data loader and utilities
    ├── test_model.py
    ├── test_data_loader.py
    └── test_utils.py

Installation

Clone:

git clone git@github.com:avrtt/MoE-speech-recognition.git
cd MoE-speech-recognition

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate

Install the required packages:
```
pip install -r requirements.txt
```

Configuration

Edit the config.yaml file to adjust model hyperparameters, training settings, data paths and audio processing parameters.

Usage

Training

Place your speech data in the data/ folder. Then, start training:

python train.py --config config.yaml

This will parse your configuration, load your dataset, initialize the mixture-of-experts model and start the training loop. Model weights will be saved to the specified checkpoint directory.

Inference

After training, use the inference script to transcribe audio files:

python infer.py --config config.yaml --audio_path path/to/audio/file.wav

Testing

Run the unit tests to verify functionality:

pytest tests/

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Features

Project structure

Installation

Configuration

Usage

Training

Inference

Testing

License

About

Uh oh!

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
model		model
tests		tests
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
data_loader.py		data_loader.py
infer.py		infer.py
requirements.txt		requirements.txt
train.py		train.py

License

avrtt/MoE-speech-recognition

Folders and files

Latest commit

History

Repository files navigation

Features

Project structure

Installation

Configuration

Usage

Training

Inference

Testing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages