MahaDhwani

MahaDhwani is a corpus comprising 279K hours of raw audio across 22 Indian languages and English. We propose a framework to create large raw audio datasets for under-represented languages by collating publicly accessible audio content.

Download

Our Dataflow Pipeline works as follows-

Video_ids are fetched from a postgreSQL table which contains all the video_ids, metadata and related info(which is blank intially).
Video_ids are assigned to each VM through dataflow.
Audios are downloaded on each VM using yt-dlp and then uploaded to the cloud bucket.
After successful upload, the PostgreSQL table is updated with the metadata, bucket_path, duration, file size, etc.

Steps to run the pipeline -

Setup a PostgreSqL table, a cloud bucket and update the code in pipeline.py accordingly.
Setup a GCP account for dataflow access.
Create and push the Dockerfile provided for setting up VM environments.
- Make sure that the apache beam version in the dockerfile matches with the local environment(Python 3.10 and Apache beam sdk 2.58.1 were used here).
Run the bash script -

bash run.sh

For filtering video_ids based on metadata, dataflow_pipeline/languages/*/video_ids_*.csv files can be used.

Model

Model	Link
MahaDhwani Pretrained Conformer Encoder	HF

Analysis

1. Effect of Pretraining -

We compare the performance of ASR models fine-tuned on the IndicVoices dataset, starting with (i) random initialisation (ii) pretrained checkpoint of an English ASR model, Nvidia-En-SSL (iii) the IndicConformer-HMS model pretrained on MahaDhwani. Each of the above is trained 12.5%, 50% and 100% of the labeled training data.

2. Comparisons with other models -

We compare IndicConformer-HMS(pretrained with MahaDhwani) with existing massively multilingual models, namely, USM, Whisper and MMS. We find that it significantly outperforms all existing models.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
dataflow_pipeline		dataflow_pipeline
stats		stats
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MahaDhwani

Download

Our Dataflow Pipeline works as follows-

Steps to run the pipeline -

Model

Analysis

1. Effect of Pretraining -

2. Comparisons with other models -

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

AI4Bharat/MahaDhwani

Folders and files

Latest commit

History

Repository files navigation

MahaDhwani

Download

Our Dataflow Pipeline works as follows-

Steps to run the pipeline -

Model

Analysis

1. Effect of Pretraining -

2. Comparisons with other models -

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages