Skip to content

TomC333/NLP-Georgian-Language-Corpus

Repository files navigation

NLP Georgian Language Corpus 🇬🇪

Build Status License Dataset

This project started as a university assignment for natural language processing: collecting clean Georgian text from Common Crawl data.

Scraping Georgian websites proved tricky — regexes, language detection, and URL filters all fell short. This code is a reference point, not a perfect solution.

The resulting corpus is publicly available on Hugging Face:
👉 https://huggingface.co/datasets/TomC333/georgian-language-corpus

It’s not massive or perfect, but it’s a useful starting point for anyone interested in Georgian NLP.

About

Georgian Language Corpus

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages