Skip to content

Python scraper for the Polish Parliament (Sejm) API - extracts MP details and voting records with PostgreSQL storage and Docker support

Notifications You must be signed in to change notification settings

seszele64/sejm-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sejm API Scraper

This project provides tools to scrape and store data from the Polish Parliament (Sejm) API. It consists of two main components: MP Scraper and Voting Scraper.

Features

  • MP Scraper: Fetches information about Members of Parliament
  • Voting Scraper: Retrieves voting records for MPs
  • PostgreSQL Storage: Stores all data in a structured database
  • Docker Support: Easy database setup and management

Project Structure

sejm-api-scraper
├── docker
│   ├── docker-compose.yml       # Docker configuration for services
│   └── postgres
│       └── init.sql             # SQL commands for database initialization
├── src
│   ├── __init__.py              # Marks src as a Python package
│   ├── config
│   │   ├── __init__.py          # Marks config as a Python package
│   │   └── settings.py          # Configuration settings and environment variables
│   ├── db
│   │   ├── __init__.py          # Marks db as a Python package
│   │   ├── connection.py         # Database connection logic
│   │   └── models.py             # Database models using an ORM
│   ├── schemas
│   │   ├── __init__.py          # Marks schemas as a Python package
│   │   └── parliament_models.py   # Data models specific to the Sejm API
│   ├── scrapers
│   │   ├── __init__.py          # Marks scrapers as a Python package
│   │   ├── base_scraper.py      # Base scraper class with common functionality
│   │   ├── mp_scraper.py        # MP scraper for retrieving member data
│   │   └── voting_scraper.py    # Voting scraper for retrieving voting records
│   └── utils
│       ├── __init__.py          # Marks utils as a Python package
│       ├── logger.py             # Logging setup for the application
│       └── helpers.py            # Utility functions for the application
├── tests
│   ├── __init__.py              # Marks tests as a Python package
│   ├── conftest.py              # Configuration for pytest
│   ├── test_mp_scraper.py       # Unit tests for the MP scraper
│   └── test_voting_scraper.py   # Unit tests for the voting scraper
├── .env.example                  # Example environment variables
├── .gitignore                    # Files and directories to ignore by Git
├── Dockerfile                    # Docker image definition for the application
├── requirements.txt              # Python dependencies for the project
├── setup.py                      # Packaging information for the application
└── README.md                     # Documentation for the project

Installation

  1. Clone this repository:

    git clone https://github.com/yourusername/sejm-scraper.git
    cd sejm-scraper
  2. Create and activate a virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt

Database Setup

The project uses PostgreSQL running in Docker for data storage:

  1. Start the database:

    cd docker
    sudo docker-compose up -d
  2. Verify the database is running:

    sudo docker exec -it sejm_postgres psql -U postgres -d sejm_db -c "SELECT 1;"

API Documentation

The official API for the Polish Parliament (Sejm) can be found at https://api.sejm.gov.pl/sejm.html. It provides detailed information about available endpoints and their usage.

Running the Scrapers

MP Scraper

The MP scraper fetches data about Members of Parliament from the Sejm API and stores it in the database.

# Run with default settings
python -m src.scrapers.mp_scraper

# Run with debug logging
python -m src.scrapers.mp_scraper --debug

# Specify a different Sejm term (default: 10)
python -m src.scrapers.mp_scraper --term 9

Voting Scraper

The voting scraper retrieves voting records for MPs.

# Fetch voting data for an MP
python -m src.scrapers.voting_scraper --mp-id 123 --proceeding 1 --date 2023-11-13

# Run with debug logging
python -m src.scrapers.voting_scraper --mp-id 123 --proceeding 1 --date 2023-11-13 --debug

Module Details

MP Scraper (src.scrapers.mp_scraper)

The MP scraper performs the following functions:

  • Fetches a list of all MPs for the specified term
  • Retrieves detailed information for each MP
  • Stores or updates MP data in the database

Key methods:

  • fetch_mp_list(): Retrieves the list of MPs
  • fetch_mp_details(mp_id): Gets detailed information for a specific MP
  • save_mp_to_db(mp_data, db_session): Saves or updates MP data in the database
  • run(): Executes the complete scraping process

Voting Scraper (src.scrapers.voting_scraper)

The voting scraper retrieves and processes voting records:

  • Fetches voting data for specified MPs, proceedings, and dates
  • Processes and formats the data
  • Stores voting records in the database

Key methods:

  • fetch_voting_data(leg, proceeding, date, term): Retrieves voting data from the API
  • store_voting_data(voting_data, session): Saves voting records to the database
  • scrape(leg, proceeding, date): Complete process of fetching and storing voting data

Verifying Data

To check if the data has been successfully stored:

# Connect to the database
sudo docker exec -it sejm_postgres psql -U postgres -d sejm_db

# View MPs data
SELECT COUNT(*) FROM members_of_parliament;
SELECT id, first_name, last_name, club FROM members_of_parliament LIMIT 10;

# View voting data
SELECT COUNT(*) FROM voting_records;
SELECT mp_id, voting_date, vote, description FROM voting_records LIMIT 10;

Troubleshooting

  • If you encounter database connection issues, ensure Docker is running and the PostgreSQL container is up
  • For API connection problems, check your internet connection and verify the Sejm API endpoints are accessible
  • Debug mode can be enabled with the --debug flag for more detailed logs

License

This project is licensed under the MIT License.

About

Python scraper for the Polish Parliament (Sejm) API - extracts MP details and voting records with PostgreSQL storage and Docker support

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published