Skip to content

guaran-ia/guarascraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Guarascraper

Web scraper application for online Guarani text developed under the GuaranIA project as part of the initiative UCA Autumn of Code 2025.


Websites in Guarani

Manually identified websites that contain text in Guarani

Installation

Prerequisites

  • Python 3.12+
  • pip (Python package manager)

Setup Instructions

  1. Clone the repository:

    git clone https://github.com/guaran-ia/guarascrapper
    cd guarascrapper
  2. Create and activate a virtual environment (recommended):

    python3 -m venv venv
    
    # On Windows
    venv\Scripts\activate
    
    # On macOS/Linux
    source venv/bin/activate
    1. Install dependencies:
     pip3 install -r requirements.txt
    1. Download the FastText language identification model:
    mkdir -p src/guarani_scraper/utils/lang_model
    
    curl https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -o src/guarani_scraper/utils/lang_model/lid.176.bin

Usage

Basic Usage

Option 1: Scrape from CSV file To run the scraper using the included list of Guarani websites:

python3 cli.py --csv data/web_sources.csv

Option 2: Scrape a single URL To scrape a specific website:

python3 cli.py --url https://guaranimeme.blogspot.com

The scraped Guarani text is saved in the corpus directory.

Configuration

You can modify the following files to adjust the scraper's behavior:

  • src/guarani_scraper/settings.py: Adjust crawling settings like delay, throttling, and user agent
  • src/guarani_scraper/guarani_scraper/utils/lang_detector.py: Fine-tune the language detection logic
  • data/web_sources.csv: Add or remove websites to be scraped

Acknowledgement

The scraper proved to work correctly on the following identified websites

Additional work is required to have the application correctly scrape the following identified sites

The following identified websites have not been tested yet

About

Web scrapper for Guarani text available online

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages