This repository contains a pipeline to evaluate the grammatical accuracy of spoken audio using a fusion of ASR (Automatic Speech Recognition) and deep learning models. The solution transcribes audio using Wav2Vec2, preprocesses the text, and predicts grammar scores using a BERT + BiLSTM regression model.
Given an audio clip of a candidate speaking, predict a grammar score (1–5) reflecting their spoken grammar quality.
train.csv
: Contains audio file names + grammar labels.audios_train/
: Folder of training.wav
files.test.csv
/audios_test/
: Evaluation set without labels.
- Model:
facebook/wav2vec2-base-960h
- Output: Lowercased text using
batch_decode
.
- Remove punctuation, normalize spaces, lowercase.
- Architecture: BERT Embeddings → BiLSTM → Dense Regression Head
- Loss: MSELoss, optional Pearson/CCC loss
- Framework: PyTorch
- Metric: Pearson Correlation on validation set.
- Visualization: Score distributions, loss curves.
Metric | Value |
---|---|
Pearson (val) | 0.599 |
MSE (val) | 5.78 50 epochs |
# Step 1: Texttranscription
python texttranscription.py
# Step 2: Preprocess text
python preprocess.py
# Step 3: Train model
python grammar_score_model.py
# Step 4: Run inference on test set
python inference.py