Web scraper application for online Guarani text developed under the GuaranIA project as part of the initiative UCA Autumn of Code 2025.
Manually identified websites that contain text in Guarani
- Secretaría Nacional de Cultura Paraguay: part of paraguayan goverment sites
- Secreataria de Politica Linguistica: part of paraguayan goverment sites
- ABC Color: paraguayan newspaper
- Facultad de humanidades, ciencias sociales y cultura guaraní: paraguayan university
- Yvy Marãe'ỹ: institute for culturarl studies
- Misa Guarani: church readings
- Portal Guarani: history and culture of paraguay
- Guarani Raity: some sort of guarani library
- Vikipetã: wikipedia in guarani
- jw.org: jehovah witnesses site
- Ultima hora: paraguayan newspaper
- Ñane Ñe'ẽ Guarani: blog about guarani
- GuaraniMeme: blog about guarani
- lenguagurani: blog about guarani
- Constitución: paraguayan constitution in guarani
- Guarani Renda: bilingual site
- Sociedad Biblica Paraguay: biblical passages
- Ministerio de Economia y Finanzas Paraguay: articles in guarani from a part of paraguayan goverment site
- Python 3.12+
- pip (Python package manager)
-
Clone the repository:
git clone https://github.com/guaran-ia/guarascrapper cd guarascrapper
-
Create and activate a virtual environment (recommended):
python3 -m venv venv # On Windows venv\Scripts\activate # On macOS/Linux source venv/bin/activate
- Install dependencies:
pip3 install -r requirements.txt
- Download the FastText language identification model:
mkdir -p src/guarani_scraper/utils/lang_model curl https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -o src/guarani_scraper/utils/lang_model/lid.176.bin
Option 1: Scrape from CSV file To run the scraper using the included list of Guarani websites:
python3 cli.py --csv data/web_sources.csv
Option 2: Scrape a single URL To scrape a specific website:
python3 cli.py --url https://guaranimeme.blogspot.com
The scraped Guarani text is saved in the corpus
directory.
You can modify the following files to adjust the scraper's behavior:
src/guarani_scraper/settings.py
: Adjust crawling settings like delay, throttling, and user agentsrc/guarani_scraper/guarani_scraper/utils/lang_detector.py
: Fine-tune the language detection logicdata/web_sources.csv
: Add or remove websites to be scraped
The scraper proved to work correctly on the following identified websites
- GuaraniMeme: blog about guarani
- Portal Guarani: history and culture of paraguay
- Facultad de humanidades, ciencias sociales y cultura guaraní: paraguayan university
- Guarani Raity: some sort of guarani library
- Constitución: paraguayan constitution in guarani
- Vikipetã: wikipedia in guarani
- Agencia de Información Paraguaya: paraguayan information agency
- jw.org: jehovah witnesses site
- Secretaría de Políticas Linguisticas Paraguay: part of paraguayan goverment sites
- Secretaría Nacional de Cultura Paraguay: part of paraguayan goverment sites
- Yvy Marãe'ỹ: institute for cultural studies
Additional work is required to have the application correctly scrape the following identified sites
- ABC: paraguayan newspaper
- Misa Guarani: church readings
- Ultima hora: paraguayan newspaper
The following identified websites have not been tested yet
- Ñane Ñe'ẽ Guarani: blog about guarani
- lenguagurani: blog about guarani
- Ñe'ẽ: journal of linguistic and cultural research
- Guarani Renda: bilingual site
- Sociedad Biblica Paraguay: biblical passages
- Ministerio de Economia y Finanzas Paraguay: articles in guarani from a part of paraguayan goverment site