This project started as a university assignment for natural language processing: collecting clean Georgian text from Common Crawl data.
Scraping Georgian websites proved tricky — regexes, language detection, and URL filters all fell short. This code is a reference point, not a perfect solution.
The resulting corpus is publicly available on Hugging Face:
👉 https://huggingface.co/datasets/TomC333/georgian-language-corpus
It’s not massive or perfect, but it’s a useful starting point for anyone interested in Georgian NLP.