A curated, community-maintained list of free and open datasets for Artificial Intelligence and Machine Learning projects focused on Africa.
Datasets are tagged by domain, have Kaggle-style metadata, and are verified to be truly free for research or commercial use (license permitting).
β Always check individual dataset licenses before use.
- Natural Language Processing (NLP)
- Speech / Voice
- Computer Vision & Wildlife
- Geospatial & Agriculture
- Climate & Weather
- Health & Demographics
- Contribution Guide
- Description: Named Entity Recognition datasets for multiple African languages.
- Languages: Yoruba, Hausa, Igbo, Swahili, Amharic, Wolof, Kinyarwanda, and more.
- Size: ~20K annotated sentences.
- Samples: Named entities tagged in context.
- Tasks: NER model training, evaluation, transfer learning.
- Source: Masakhane Project.
- Link: https://github.com/masakhane-io/masakhaner
- License: Mixed permissive licenses.
- Last Updated: 2021-05
- Best For: Low-resource NLP research, multilingual NER.
- Description: POS-tagged datasets for African languages.
- Languages: Yoruba, Hausa, Igbo, Swahili, Wolof, etc.
- Size: ~10K sentences.
- Samples: Tokenized and tagged sentences.
- Tasks: POS tagging model development.
- Source: Masakhane Project.
- Link: https://github.com/masakhane-io/masakhapos
- License: CC BY-SA 4.0.
- Last Updated: 2020-11
- Best For: Linguistic modeling & POS benchmarking.
- Description: Children's storybooks in multiple African languages.
- Languages: Zulu, Xhosa, Swahili, Amharic, Hausa, etc.
- Size: 3,000+ books.
- Samples: Parallel text in multiple languages.
- Tasks: Machine translation, text generation.
- Source: African Storybook Project.
- Link: https://www.africanstorybook.org
- License: CC BY 4.0.
- Last Updated: 2023-04
- Best For: Multilingual MT, literacy applications.
- Languages: Swahili, Chichewa, Amharic, Luganda, Kinyarwanda, and more.
- Link: https://commonvoice.mozilla.org/en/datasets
- License: CC0.
- Best For: ASR, TTS.
- Link: https://github.com/getalp/ALFFA_PUBLIC
- Best For: Low-resource ASR.
- Link: http://openslr.org
- Size: ~3.2M images.
- Link: http://lila.science/datasets/snapshot-serengeti
- Link: https://africapolis.org
- Link: https://tahmo.org
- Link: https://fews.net
- Link: https://dhsprogram.com
- Link: https://aho.afro.who.int
We welcome pull requests to add, update, or improve dataset entries.
Steps:
- Fork this repository.
- Add your dataset entry in the correct section using the template below.
### Dataset Name
  
- **Description**: Short description of the dataset.
- **Languages / Geography**: List languages or regions covered.
- **Size**: Approximate size in MB/GB or number of samples.
- **Samples**: Brief description of sample type.
- **Tasks**: List AI/ML tasks supported.
- **Source**: Organization or project name.
- **Link**: [Dataset link](https://example.com)
- **License**: License type.
- **Last Updated**: YYYY-MM.
- **Best For**: Suggested research/application areas.