📌 Official DOI:
🤗 Also available on Hugging Face
📥 Download Dataset (.rar from Zenodo)
This dataset, titled ELNER-DZ, was created by Bouguettoucha Hadjer Hanine and Djouablia Ilhem as part of our Master’s thesis. It is the first large-scale dataset designed for Named Entity Recognition (NER) and Entity Linking (EL) in Algerian Arabic Dialect (Darija), including both Arabic script and Arabizi (Latin-script).
This dataset contains over 2 million dialectal sentences labeled with more than 1.9 million named entities and linked to Wikidata QIDs.
- Name: ELNER-DZ
- Languages: Arabic (
arfor MSA,arqfor dialectal), Arabizi (Latin), French (fr), English (en) - Script: Arabic and Latin (Arabizi)
- Format: JSON (compressed in
data.rar) - Annotations:
- Named Entity spans (start, end)
- NER labels (PER, LOC, ORG, etc.)
- Normalized forms
- Wikidata QIDs
data/data.rar— Compressed archive containingdata.jsonexamples/loading_example.py— Script to extract and load the datasetLICENSE— CC-BY-4.0dataset_card.md— Hugging Face dataset summary
{
"id": 188,
"text": "3reft wa7ed lperson khadem f Yassir",
"entities": [
{
"start": 29,
"end": 35,
"label": "ORG",
"wikidata_id": "Q117156470",
"normalized": "Yassir"
}
]
}PER: PersonLOC: LocationORG: OrganizationPROD: ProductLAW: Legal texts or rulesLANG: LanguageEVENT: EventsDATE: Temporal expressionsNORP: Nationality/Religious/Political groupsSPORT: Sports & CompetitionsSYMPTOM,DISEASE: Medical categoriesMISC: Miscellaneous
- Named Entity Recognition (NER)
- Entity Linking (EL) with Wikidata
- Dialectal NLP in Algerian Arabic
- Code-switching and multiscript modeling
- Low-resource transfer learning
pip install datasets rarfile
sudo apt-get install unrar # For Linuxpython examples/loading_example.pyOr manually extract and load:
import rarfile
rf = rarfile.RarFile("data/data.rar")
rf.extractall("data/")
from datasets import load_dataset
dataset = load_dataset("json", data_files="data/data.json", split="train")
print(dataset[0])-
Source: Social media, dialogues, e-commerce, Wikidata SPARQL
-
Annotation:
- Semi-automated and rule-based extraction
- Manual normalization of entity surface forms
- Wikidata QID linking via SPARQL and fallback search
- Bouguettoucha Hadjer Hanine
- Djouablia Ilhem
This dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
@dataset{bouguettoucha_djouablia_2025,
author = {Bouguettoucha, Hadjer Hanine and Djouablia, Ilhem},
title = {ELNER-DZ: A Dataset for Named Entity Recognition and Linking in Algerian Arabic},
year = 2025,
publisher = {Zenodo},
doi = {10.5281/zenodo.15798592},
url = {https://doi.org/10.5281/zenodo.15798592}
}