This project provides tools to scrape and store data from the Polish Parliament (Sejm) API. It consists of two main components: MP Scraper and Voting Scraper.
- MP Scraper: Fetches information about Members of Parliament
- Voting Scraper: Retrieves voting records for MPs
- PostgreSQL Storage: Stores all data in a structured database
- Docker Support: Easy database setup and management
sejm-api-scraper
├── docker
│ ├── docker-compose.yml # Docker configuration for services
│ └── postgres
│ └── init.sql # SQL commands for database initialization
├── src
│ ├── __init__.py # Marks src as a Python package
│ ├── config
│ │ ├── __init__.py # Marks config as a Python package
│ │ └── settings.py # Configuration settings and environment variables
│ ├── db
│ │ ├── __init__.py # Marks db as a Python package
│ │ ├── connection.py # Database connection logic
│ │ └── models.py # Database models using an ORM
│ ├── schemas
│ │ ├── __init__.py # Marks schemas as a Python package
│ │ └── parliament_models.py # Data models specific to the Sejm API
│ ├── scrapers
│ │ ├── __init__.py # Marks scrapers as a Python package
│ │ ├── base_scraper.py # Base scraper class with common functionality
│ │ ├── mp_scraper.py # MP scraper for retrieving member data
│ │ └── voting_scraper.py # Voting scraper for retrieving voting records
│ └── utils
│ ├── __init__.py # Marks utils as a Python package
│ ├── logger.py # Logging setup for the application
│ └── helpers.py # Utility functions for the application
├── tests
│ ├── __init__.py # Marks tests as a Python package
│ ├── conftest.py # Configuration for pytest
│ ├── test_mp_scraper.py # Unit tests for the MP scraper
│ └── test_voting_scraper.py # Unit tests for the voting scraper
├── .env.example # Example environment variables
├── .gitignore # Files and directories to ignore by Git
├── Dockerfile # Docker image definition for the application
├── requirements.txt # Python dependencies for the project
├── setup.py # Packaging information for the application
└── README.md # Documentation for the project
-
Clone this repository:
git clone https://github.com/yourusername/sejm-scraper.git cd sejm-scraper
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
The project uses PostgreSQL running in Docker for data storage:
-
Start the database:
cd docker sudo docker-compose up -d
-
Verify the database is running:
sudo docker exec -it sejm_postgres psql -U postgres -d sejm_db -c "SELECT 1;"
The official API for the Polish Parliament (Sejm) can be found at https://api.sejm.gov.pl/sejm.html. It provides detailed information about available endpoints and their usage.
The MP scraper fetches data about Members of Parliament from the Sejm API and stores it in the database.
# Run with default settings
python -m src.scrapers.mp_scraper
# Run with debug logging
python -m src.scrapers.mp_scraper --debug
# Specify a different Sejm term (default: 10)
python -m src.scrapers.mp_scraper --term 9
The voting scraper retrieves voting records for MPs.
# Fetch voting data for an MP
python -m src.scrapers.voting_scraper --mp-id 123 --proceeding 1 --date 2023-11-13
# Run with debug logging
python -m src.scrapers.voting_scraper --mp-id 123 --proceeding 1 --date 2023-11-13 --debug
The MP scraper performs the following functions:
- Fetches a list of all MPs for the specified term
- Retrieves detailed information for each MP
- Stores or updates MP data in the database
Key methods:
fetch_mp_list()
: Retrieves the list of MPsfetch_mp_details(mp_id)
: Gets detailed information for a specific MPsave_mp_to_db(mp_data, db_session)
: Saves or updates MP data in the databaserun()
: Executes the complete scraping process
The voting scraper retrieves and processes voting records:
- Fetches voting data for specified MPs, proceedings, and dates
- Processes and formats the data
- Stores voting records in the database
Key methods:
fetch_voting_data(leg, proceeding, date, term)
: Retrieves voting data from the APIstore_voting_data(voting_data, session)
: Saves voting records to the databasescrape(leg, proceeding, date)
: Complete process of fetching and storing voting data
To check if the data has been successfully stored:
# Connect to the database
sudo docker exec -it sejm_postgres psql -U postgres -d sejm_db
# View MPs data
SELECT COUNT(*) FROM members_of_parliament;
SELECT id, first_name, last_name, club FROM members_of_parliament LIMIT 10;
# View voting data
SELECT COUNT(*) FROM voting_records;
SELECT mp_id, voting_date, vote, description FROM voting_records LIMIT 10;
- If you encounter database connection issues, ensure Docker is running and the PostgreSQL container is up
- For API connection problems, check your internet connection and verify the Sejm API endpoints are accessible
- Debug mode can be enabled with the
--debug
flag for more detailed logs
This project is licensed under the MIT License.