A web crawler designed to scrape pokemon card prices from TCGPlayer.com and export them to .csv files.
-
Install Python 3, if you do not have it already.
-
Create a new virtual environment:
python -m venv venv -
Enter the virtual environment:
Powershell:
. .venv\Scripts\Activate.ps1cmd.exe:
. .venv\Scripts\activate.bat
Linux:
source .venv/bin/activate -
Install dependencies:
pip install -r requirements playwright install
- Enter the virtual environment, if you are not in it already. (See step 3 of the installation instructions)
- Run the crawler with the following command:
scrapy crawl 'main` - A window will pop up with a list of sets that can be scrapped. Check the ones that you want and then close the window.
- Wait and eventually it should complete.
| File | Purpose |
|---|---|
| settings.py | Settings for Scrapy and the spider |
| pipelines.py | Pipeline that takes items and outputs them to CSV files. |
| items.py | The data structure for the scraped data |
| spiders/main_spider.py | The spider code that handles requesting and parsing data. |
| Dependency | Min Version | Reason Used | Notes |
|---|---|---|---|
| scrapy | 2.11.0 | Framework that orchestrates the scraping process and provides a CLI tool for running the scaper. | |
| playwright | 1.15 | Runs a headless browser that downloads dynamic content. | |
| scrapy-playwright | Special | Implements a Scrapy download handler that lets scrapy download pages using playwright. | This project uses a fork of scrapy-playwright that lets it run on Windows, rather than just Linux. This is included in source form in this project rather than as a submodule |
| wxPython | 4.2.1 | Used to implement the set selector window |