Guarascraper

Web scraper application for online Guarani text developed under the GuaranIA project as part of the initiative UCA Autumn of Code 2025.

Websites in Guarani

Manually identified websites that contain text in Guarani

Secretaría Nacional de Cultura Paraguay: part of paraguayan goverment sites
Secreataria de Politica Linguistica: part of paraguayan goverment sites
ABC Color: paraguayan newspaper
Facultad de humanidades, ciencias sociales y cultura guaraní: paraguayan university
Yvy Marãe'ỹ: institute for culturarl studies
Misa Guarani: church readings
Portal Guarani: history and culture of paraguay
Guarani Raity: some sort of guarani library
Vikipetã: wikipedia in guarani
jw.org: jehovah witnesses site
Ultima hora: paraguayan newspaper
Ñane Ñe'ẽ Guarani: blog about guarani
GuaraniMeme: blog about guarani
lenguagurani: blog about guarani
Constitución: paraguayan constitution in guarani
Guarani Renda: bilingual site
Sociedad Biblica Paraguay: biblical passages
Ministerio de Economia y Finanzas Paraguay: articles in guarani from a part of paraguayan goverment site

Installation

Prerequisites

Python 3.12+
pip (Python package manager)

Setup Instructions

Clone the repository:

git clone https://github.com/guaran-ia/guarascrapper
cd guarascrapper

Create and activate a virtual environment (recommended):

python3 -m venv venv

# On Windows
venv\Scripts\activate

# On macOS/Linux
source venv/bin/activate

Install dependencies:

 pip3 install -r requirements.txt

Download the FastText language identification model:

mkdir -p src/guarani_scraper/utils/lang_model

curl https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -o src/guarani_scraper/utils/lang_model/lid.176.bin

Usage

Basic Usage

Option 1: Scrape from CSV file To run the scraper using the included list of Guarani websites:

python3 cli.py --csv data/web_sources.csv

Option 2: Scrape a single URL To scrape a specific website:

python3 cli.py --url https://guaranimeme.blogspot.com

The scraped Guarani text is saved in the corpus directory.

Configuration

You can modify the following files to adjust the scraper's behavior:

src/guarani_scraper/settings.py: Adjust crawling settings like delay, throttling, and user agent
src/guarani_scraper/guarani_scraper/utils/lang_detector.py: Fine-tune the language detection logic
data/web_sources.csv: Add or remove websites to be scraped

Acknowledgement

The scraper proved to work correctly on the following identified websites

GuaraniMeme: blog about guarani
Portal Guarani: history and culture of paraguay
Facultad de humanidades, ciencias sociales y cultura guaraní: paraguayan university
Guarani Raity: some sort of guarani library
Constitución: paraguayan constitution in guarani
Vikipetã: wikipedia in guarani
Agencia de Información Paraguaya: paraguayan information agency
jw.org: jehovah witnesses site
Secretaría de Políticas Linguisticas Paraguay: part of paraguayan goverment sites
Secretaría Nacional de Cultura Paraguay: part of paraguayan goverment sites
Yvy Marãe'ỹ: institute for cultural studies

Additional work is required to have the application correctly scrape the following identified sites

ABC: paraguayan newspaper
Misa Guarani: church readings
Ultima hora: paraguayan newspaper

The following identified websites have not been tested yet

Ñane Ñe'ẽ Guarani: blog about guarani
lenguagurani: blog about guarani
Ñe'ẽ: journal of linguistic and cultural research
Guarani Renda: bilingual site
Sociedad Biblica Paraguay: biblical passages
Ministerio de Economia y Finanzas Paraguay: articles in guarani from a part of paraguayan goverment site

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
data		data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cli.py		cli.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Guarascraper

Websites in Guarani

Installation

Prerequisites

Setup Instructions

Usage

Basic Usage

Configuration

Acknowledgement

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

guaran-ia/guarascraper

Folders and files

Latest commit

History

Repository files navigation

Guarascraper

Websites in Guarani

Installation

Prerequisites

Setup Instructions

Usage

Basic Usage

Configuration

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages