This repository contains a speech-to-text and language identification system built in PyTorch as a mixture-of-experts architecture that combines transformers, RNNs and CNNs. It supports various audio and video formats, and can work with multiple languages and dialects.
The project was originally developed as just a pet project on speech recognition to gain some PyTorch skills. The model was trained on private data comprising thousands of hours of speech. I published it for educational purposes and (probably) to help somebody with captioning, transcription or speech analysis.
- Mixture architecture: combines transformer, RNN and CNN branches in a mixture-of-experts model
- Multi-language support
- Configuration: easily configure audio properties (frequency, sampling rate, channels, encoding) and input/output settings through a YAML config
- Advanced audio processing: supports feature extraction (spectrograms, mel-spectrograms) and optional augmentation
.
├── README.md
├── requirements.txt
├── config.yaml
├── train.py # training entry point
├── infer.py # inference script for captioning and transcription
├── data_loader.py # custom dataset and data loader definitions
├── model/ # contains all model definitions
│ ├── __init__.py
│ ├── base_model.py
│ ├── transformer_module.py
│ ├── rnn_module.py
│ ├── cnn_module.py
│ └── mixture_of_experts.py
├── utils/ # utility modules for logging, config management, audio processing and metrics
│ ├── logger.py
│ ├── config.py
│ ├── audio_processing.py
│ └── metrics.py
└── tests/ # unit tests for model, data loader and utilities
├── test_model.py
├── test_data_loader.py
└── test_utils.py
-
Clone:
git clone git@github.com:avrtt/MoE-speech-recognition.git cd MoE-speech-recognition
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate
-
Install the required packages:
pip install -r requirements.txt
Edit the config.yaml
file to adjust model hyperparameters, training settings, data paths and audio processing parameters.
Place your speech data in the data/
folder. Then, start training:
python train.py --config config.yaml
This will parse your configuration, load your dataset, initialize the mixture-of-experts model and start the training loop. Model weights will be saved to the specified checkpoint directory.
After training, use the inference script to transcribe audio files:
python infer.py --config config.yaml --audio_path path/to/audio/file.wav
Run the unit tests to verify functionality:
pytest tests/
MIT