Moore Speech Corpora Toolkit

Modular tools to collect, preprocess, align, and prepare Mooré speech/text data for Text-to-Speech (TTS) and Automatic Speech Recognition (ASR).

This toolkit reproduces the full Moore Speech Corpora data collection pipeline.

Features

We offer modular, CLI-friendly scripts for tasks like:

Scrape Mooré audios and texts from online sources (Bible, YouTube, etc.)
Segment and align long audios w/ texts
Audio preprocessing: resampling, mono conversion
Moore text normalization
Audios denoising and enhancement
Export datasets to different formats (Hugging Face, LJSpeech etc...)

The toolkit is structured as follows:

moore-toolkit/
├─ crawlers/             # Bible, YouTube, etc.
├─ preprocessing/        # Resample, normalize text
├─ forced_alignment/     # MMS scripts & wrappers
├─ datasets/             # HF dataset prep & push
├─ utils/                # Shared helpers
├─ environment.yml
└─ README.md

Installation & Setup

git clone https://github.com/anyantudre/MooreSpeechCorpora.git
cd MooreSpeechCorpora
conda env create -f environment.yml
conda activate mooredata

It's highly recommended to use Python 3.10.11!!!

Quick Start

Data Crawling: crawls data from sources like Bible and YouTube.

# crawling Moore Bible example
sh ./crawlers/bible/crawl.sh

See crawlers/README.md for full instructions and more details.

Preprocessing: preprocessing.

# example resampling Moore data
bash preprocessing/resample.sh --input_folder datasets/moore/bible/raw --output_folder datasets/moore/bible/resampled

See preprocessing/README.md for full instructions and more details.

Forced Alignment: outputs segmented audio and manifest.json files for each chapter.

# run forced alignment
bash forced_alignement/align_and_segment.sh \
  --audio_folder datasets/moore/bible/resampled \
  --text_folder datasets/moore/bible/resampled \
  --output_folder datasets/moore/bible/aligned \
  --lang mos \
  --uroman_path ../uroman/bin

See forced_alignment/README.md for full instructions and more details.

Dataset Preparation/Export: uploads dataset with columns: audio, transcription, duration, chapter to Hugging Face Hub.

python data_export/prepare_hf_dataset.py --input_folder datasets/moore/bible/aligned --repo_id anyantudre/moore-speech-bible --hf_token hf_xxxx

See datasets/README.md for full instructions and more details.

Denoising & Enhancement (optional): applies Resemble Enhance to improve audio quality, optionally skipping enhancement or keeping original audio.

python denoising/denoise_and_push.py \
  --dataset_id anyantudre/moore-speech-bible \
  --output_repo_id anyantudre/moore-speech-bible-denoised \
  --hf_token hf_xxxx \
  --enhance_audio \
  --keep_original_audio

The resulting dataset will include denoised_audio and optionally enhanced_audio fields, depending on the flags.

See denoising/README.md for full instructions and parameters.

Contributing

Contributions are more than welcome! Please read CONTRIBUTING.md for guidelines on how to get started.

Acknowledgments

cawoylel: this repo is largely inspired by their excellent work on the Fula language!
Facebook AI Research Fairseq for multilingual alignment tools.
bible.com for Mooré audio/text
Uroman for romanization
Resemble Enhance for speech enhancement

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
crawlers		crawlers
data_export		data_export
forced_alignement		forced_alignement
preprocessing		preprocessing
tutorials		tutorials
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
pipeline_proverbes.sh		pipeline_proverbes.sh
pipelines_proverbes_2.sh		pipelines_proverbes_2.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Moore Speech Corpora Toolkit

Table of Contents

Features

Installation & Setup

Quick Start

Contributing

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

anyantudre/MooreSpeechCorpora

Folders and files

Latest commit

History

Repository files navigation

Moore Speech Corpora Toolkit

Table of Contents

Features

Installation & Setup

Quick Start

Contributing

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages