LangResourceAtlas
is a curated repository providing a comprehensive categorization of over 500 languages into "high," "medium," and "low" resource groups based on their digital linguistic data availability. This initiative aims to serve as a foundational resource for researchers, developers, and practitioners working on multilingual Natural Language Processing (NLP) tasks, especially those focusing on less-resourced languages. We aim at:
- Standardizing Resource Levels: Offering a shared reference for understanding language resource availability across a wide spectrum of languages unified into ISO 639-3 standard and writing systems in ISO 15924 standard.
- Consolidating Information: Bringing together insights from various prominent datasets and research efforts into a single, accessible location.
The categorization in LangResourceAtlas
is informed by a careful analysis and synthesis of information derived from, but not limited to, the following critical data sources and research initiatives:
- FineWeb: Large-scale web-crawled text data, providing insights into monolingual text availability.
- MaLA Corpus: A multilingual corpus (including both monolingual and parallel), designed for adapting language models into massively multilingual scenario.
- Joshi et al. (2020): A study that looks at the relation between the types of languages, resources, and their representation in NLP conferences to understand the trajectory that different languages have followed over time. Joshi, P., Santy, S., Budhiraja, A., Bali, K., & Choudhury, M. (2020). The state and fate of linguistic diversity and inclusion in the NLP world. arXiv preprint arXiv:2004.09095.
- DCAD-2000: TODO
- Glot500-c: Head languages if supported by XLM-R; otherwise, tail languages.
We welcome contributions to improve the accuracy and coverage of LangResourceAtlas
! If you have:
- New data sources that can inform resource categorization.
- Corrections to existing categorizations.
- Suggestions for improving the methodology or data format.
Please open an issue or submit a pull request.
For any questions or inquiries, please open an issue on this GitHub repository or join our Discord server MaLA-LM.
If you use LangResourceAtlas
in your research or work, please cite it using the following BibTeX entry:
@misc{LangResourceAtlas,
author = {Li, Zihao and Ji, Shaoxiong},
title = {{LangResourceAtlas: A Comprehensive Map of Language Resource Categorization}},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/MaLA-LM/LangResourceAtlas}}
}