MahaDhwani is a corpus comprising 279K hours of raw audio across 22 Indian languages and English. We propose a framework to create large raw audio datasets for under-represented languages by collating publicly accessible audio content.
- Video_ids are fetched from a postgreSQL table which contains all the video_ids, metadata and related info(which is blank intially).
- Video_ids are assigned to each VM through dataflow.
- Audios are downloaded on each VM using yt-dlp and then uploaded to the cloud bucket.
- After successful upload, the PostgreSQL table is updated with the metadata, bucket_path, duration, file size, etc.
- Setup a PostgreSqL table, a cloud bucket and update the code in
pipeline.py
accordingly. - Setup a GCP account for dataflow access.
- Create and push the
Dockerfile
provided for setting up VM environments.- Make sure that the apache beam version in the dockerfile matches with the local environment(Python 3.10 and Apache beam sdk 2.58.1 were used here).
- Run the bash script -
bash run.sh
- For filtering video_ids based on metadata,
dataflow_pipeline/languages/*/video_ids_*.csv
files can be used.
Model | Link |
---|---|
MahaDhwani Pretrained Conformer Encoder | HF |
We compare the performance of ASR models fine-tuned on the IndicVoices dataset, starting with (i) random initialisation (ii) pretrained checkpoint of an English ASR model, Nvidia-En-SSL (iii) the IndicConformer-HMS model pretrained on MahaDhwani. Each of the above is trained 12.5%, 50% and 100% of the labeled training data.
We compare IndicConformer-HMS(pretrained with MahaDhwani) with existing massively multilingual models, namely, USM, Whisper and MMS. We find that it significantly outperforms all existing models.