🌍 Awesome African AI Datasets

A curated, community-maintained list of free and open datasets for Artificial Intelligence and Machine Learning projects focused on Africa.

Datasets are tagged by domain, have Kaggle-style metadata, and are verified to be truly free for research or commercial use (license permitting).

✅ Always check individual dataset licenses before use.

🗣 Natural Language Processing (NLP)

MasakhaNER

Description: Named Entity Recognition datasets for multiple African languages.
Languages: Yoruba, Hausa, Igbo, Swahili, Amharic, Wolof, Kinyarwanda, and more.
Size: ~20K annotated sentences.
Samples: Named entities tagged in context.
Tasks: NER model training, evaluation, transfer learning.
Source: Masakhane Project.
Link: https://github.com/masakhane-io/masakhaner
License: Mixed permissive licenses.
Last Updated: 2021-05
Best For: Low-resource NLP research, multilingual NER.

MasakhaPOS

Description: POS-tagged datasets for African languages.
Languages: Yoruba, Hausa, Igbo, Swahili, Wolof, etc.
Size: ~10K sentences.
Samples: Tokenized and tagged sentences.
Tasks: POS tagging model development.
Source: Masakhane Project.
Link: https://github.com/masakhane-io/masakhapos
License: CC BY-SA 4.0.
Last Updated: 2020-11
Best For: Linguistic modeling & POS benchmarking.

African Storybooks Corpus

Description: Children's storybooks in multiple African languages.
Languages: Zulu, Xhosa, Swahili, Amharic, Hausa, etc.
Size: 3,000+ books.
Samples: Parallel text in multiple languages.
Tasks: Machine translation, text generation.
Source: African Storybook Project.
Link: https://www.africanstorybook.org
License: CC BY 4.0.
Last Updated: 2023-04
Best For: Multilingual MT, literacy applications.

🎙 Speech / Voice

Mozilla Common Voice — African Languages

Languages: Swahili, Chichewa, Amharic, Luganda, Kinyarwanda, and more.
Link: https://commonvoice.mozilla.org/en/datasets
License: CC0.
Best For: ASR, TTS.

ALFFA Public Yoruba, Hausa & Wolof Speech Corpora

Link: https://github.com/getalp/ALFFA_PUBLIC
Best For: Low-resource ASR.

OpenSLR African Corpora

Link: http://openslr.org

📸 Computer Vision & Wildlife

Snapshot Serengeti (LILA)

Size: ~3.2M images.
Link: http://lila.science/datasets/snapshot-serengeti

African Wildlife Dataset (Kaggle)

Link: https://www.kaggle.com/datasets/biancaferreira/african-wildlife

🛰 Geospatial & Agriculture

AfriCultuReS Crop Type Dataset

Link: https://africultures.net/data

Africapolis Urban Data

Link: https://africapolis.org

🌦 Climate & Weather

🏥 Health & Demographics

DHS Program — African Countries

Link: https://dhsprogram.com

WHO African Health Observatory Data

Link: https://aho.afro.who.int

Global Health Observatory — Africa

Link: https://www.who.int/data/gho

🤝 Contribution Guide

We welcome pull requests to add, update, or improve dataset entries.

Steps:

Fork this repository.
Add your dataset entry in the correct section using the template below.

Dataset Entry Template

### Dataset Name
![Domain](https://img.shields.io/badge/Domain-DOMAINCOLOR) ![License](https://img.shields.io/badge/License-LICENSETYPE-green) ![Year](https://img.shields.io/badge/Year-YYYY-orange)
- **Description**: Short description of the dataset.
- **Languages / Geography**: List languages or regions covered.
- **Size**: Approximate size in MB/GB or number of samples.
- **Samples**: Brief description of sample type.
- **Tasks**: List AI/ML tasks supported.
- **Source**: Organization or project name.
- **Link**: [Dataset link](https://example.com)
- **License**: License type.
- **Last Updated**: YYYY-MM.
- **Best For**: Suggested research/application areas.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
docs		docs
scripts		scripts
.gitlab-ci.yml		.gitlab-ci.yml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🌍 Awesome African AI Datasets

📂 Table of Contents

🗣 Natural Language Processing (NLP)

MasakhaNER

MasakhaPOS

African Storybooks Corpus

🎙 Speech / Voice

Mozilla Common Voice — African Languages

ALFFA Public Yoruba, Hausa & Wolof Speech Corpora

OpenSLR African Corpora

📸 Computer Vision & Wildlife

Snapshot Serengeti (LILA)

African Wildlife Dataset (Kaggle)

🛰 Geospatial & Agriculture

AfriCultuReS Crop Type Dataset

Africapolis Urban Data

🌦 Climate & Weather

CHIRPS

TAHMO Weather Stations

FEWS NET Africa Rainfall Estimates

🏥 Health & Demographics

DHS Program — African Countries

WHO African Health Observatory Data

Global Health Observatory — Africa

🤝 Contribution Guide

Dataset Entry Template

About

Uh oh!

Releases

Packages

Languages

AI4Africa/african-ai-datasets

Folders and files

Latest commit

History

Repository files navigation

🌍 Awesome African AI Datasets

📂 Table of Contents

🗣 Natural Language Processing (NLP)

MasakhaNER

MasakhaPOS

African Storybooks Corpus

🎙 Speech / Voice

Mozilla Common Voice — African Languages

ALFFA Public Yoruba, Hausa & Wolof Speech Corpora

OpenSLR African Corpora

📸 Computer Vision & Wildlife

Snapshot Serengeti (LILA)

African Wildlife Dataset (Kaggle)

🛰 Geospatial & Agriculture

AfriCultuReS Crop Type Dataset

Africapolis Urban Data

🌦 Climate & Weather

CHIRPS

TAHMO Weather Stations

FEWS NET Africa Rainfall Estimates

🏥 Health & Demographics

DHS Program — African Countries

WHO African Health Observatory Data

Global Health Observatory — Africa

🤝 Contribution Guide

Dataset Entry Template

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages