Skip to content

avrtt/MoE-speech-recognition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repository contains a speech-to-text and language identification system built in PyTorch as a mixture-of-experts architecture that combines transformers, RNNs and CNNs. It supports various audio and video formats, and can work with multiple languages and dialects.

The project was originally developed as just a pet project on speech recognition to gain some PyTorch skills. The model was trained on private data comprising thousands of hours of speech. I published it for educational purposes and (probably) to help somebody with captioning, transcription or speech analysis.

Features

  • Mixture architecture: combines transformer, RNN and CNN branches in a mixture-of-experts model
  • Multi-language support
  • Configuration: easily configure audio properties (frequency, sampling rate, channels, encoding) and input/output settings through a YAML config
  • Advanced audio processing: supports feature extraction (spectrograms, mel-spectrograms) and optional augmentation

Project structure

.
├── README.md
├── requirements.txt
├── config.yaml
├── train.py # training entry point
├── infer.py # inference script for captioning and transcription
├── data_loader.py # custom dataset and data loader definitions
├── model/ # contains all model definitions
│   ├── __init__.py
│   ├── base_model.py
│   ├── transformer_module.py
│   ├── rnn_module.py
│   ├── cnn_module.py
│   └── mixture_of_experts.py
├── utils/ # utility modules for logging, config management, audio processing and metrics
│   ├── logger.py
│   ├── config.py
│   ├── audio_processing.py
│   └── metrics.py
└── tests/ # unit tests for model, data loader and utilities
    ├── test_model.py
    ├── test_data_loader.py
    └── test_utils.py

Installation

  1. Clone:

    git clone git@github.com:avrtt/MoE-speech-recognition.git
    cd MoE-speech-recognition
  2. Create and activate a virtual environment:

    python -m venv venv
    source venv/bin/activate # Windows: venv\Scripts\activate
  3. Install the required packages:

    pip install -r requirements.txt

Configuration

Edit the config.yaml file to adjust model hyperparameters, training settings, data paths and audio processing parameters.

Usage

Training

Place your speech data in the data/ folder. Then, start training:

python train.py --config config.yaml

This will parse your configuration, load your dataset, initialize the mixture-of-experts model and start the training loop. Model weights will be saved to the specified checkpoint directory.

Inference

After training, use the inference script to transcribe audio files:

python infer.py --config config.yaml --audio_path path/to/audio/file.wav

Testing

Run the unit tests to verify functionality:

pytest tests/

License

MIT

About

Mixture of experts architecture for speech-to-text and language identification, built in PyTorch

Topics

Resources

License

Stars

Watchers

Forks

Languages