Marktplaats.nl (Dutch Classifieds) Listing Scraper.
usage: mpscraper [-h] [--limit LIMIT] [--headless HEADLESS] [--chromium-path CHROMIUM_PATH]
[--driver-path DRIVER_PATH] [--timeout TIMEOUT] [--recrawl-hours RECRAWL_HOURS]
[--data-dir DATA_DIR] [--wait-seconds WAIT_SECONDS]
options:
-h, --help show this help message and exit
--limit, -l LIMIT The limit of new listings to scrape. (MP_LIMIT) (default: 0)
--headless HEADLESS Run browser in headless mode. (MP_HEADLESS) (default: False)
--chromium-path CHROMIUM_PATH
Path to Chromium executable. (default: /usr/bin/chromium)
--driver-path DRIVER_PATH
Path to Chromium ChromeDriver executable. (default: None)
--timeout, -t TIMEOUT
Seconds before timeout occurs. (MP_TIMEOUT_SECONDS) (default: 10)
--recrawl-hours, -r RECRAWL_HOURS
Recrawl listings that haven't been checked for this many hours or more
(MP_RECRAWL_HOURS) (default: 24)
--data-dir, -d DATA_DIR
Directory to save output data. (default: ./)
--wait-seconds WAIT_SECONDS
Seconds to wait before re-trying after being rate-limited. (MP_WAIT_SECONDS)
(default: 10)
mkdir data/ && chown -R 1000:1000 data/
docker run -it -v ${PWD}/data:/data ghcr.io/chadsr/marktplaats-scraper:latest
poetry install
poetry run mpscraper -d data/
- Category Classification Model - Predicts the appropriate Marktplaats category for a given listing title text.
- Category Statistics - Calculates some basic data-science/statistics tasks for a given category, ranking views/popularity of listing types.