wav2vec-speech-to-text

End-to-end speech-to-text system using a custom Wav2Vec-style model built in PyTorch, with CTC decoding for transcription from raw waveform input.

Overview

This project implements a simplified Wav2Vec architecture for automatic speech recognition (ASR). It processes raw audio waveforms and produces text transcriptions using:

A convolutional feature encoder
A Transformer-based context network
A CTC projection head with log-softmax output

Model Architecture

Raw Audio → FeatureEncoder (CNN) → ContextNetwork (Transformer) → CTC Head → Transcription

Components:

FeatureEncoder
Extracts latent representations from raw waveforms using 1D convolutions. Downsamples and projects input into compact feature maps.
ContextNetwork
A stack of Transformer encoder layers that capture long-range dependencies and contextual information in the latent feature sequence.
CTC Head
A 1x1 convolution followed by log-softmax to map transformer outputs to a vocabulary distribution at each timestep, enabling CTC loss.

The components perform:

End-to-end training with CTC Loss
Greedy decoding
Contrastive loss pretraining
Classification head for auxiliary tasks

Usage

Training

python app/cli.py --mode train

Trains the model using raw audio files and paired transcripts. Uses CTC loss.

Inference

python app/cli.py --mode predict --audio path/to/audio.wav

Prints decoded transcription using greedy CTC decoding.

Optional: Contrastive Pretraining

You can pretrain the model using unlabeled data and a contrastive loss:

contrastive_loss(context, features, mask_indices)

Encourages context features to match the correct latent feature

Features

✅ Wav2Vec-style encoder + transformer context

✅ CTC decoding for ASR

✅ Greedy decoder (CTC collapse)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
app		app
src/wav2vec_speech_to_text		src/wav2vec_speech_to_text
tests		tests
.dockerignore		.dockerignore
.flake8		.flake8
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

wav2vec-speech-to-text

Overview

Model Architecture

Components:

Usage

Training

Inference

Optional: Contrastive Pretraining

Features

About

Uh oh!

Releases

Packages

Languages

olahsymbo/wav2vec-speech-to-text

Folders and files

Latest commit

History

Repository files navigation

wav2vec-speech-to-text

Overview

Model Architecture

Components:

Usage

Training

Inference

Optional: Contrastive Pretraining

Features

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages