First, we should download the repository (obviously!):
$ git clone https://github.com/MahdiRahmani80/divar-scraper.git
Next, we typically want to create a virtual environment. On Linux:
$ virtualenv .venv/
$ source .venv/bin/activate
$ pip install -r requirements.txt # Install required packages
After setting up the environment, we can configure the project in src/config/Setting.py:
SAND_BOX_MODE
: If True, the project will run in debug mode with more logs and limited scraping.USER_AGENT
: Set your user agent here. It’s easier to manage it in one place.IS_URL_UNIQUE_IN_DATA_BASE
: If True, the scraper will avoid collecting duplicate entries (as URLs must be unique).IRAN_CITIES_JSON_PATH
: Set the path to Iran’s cities and provinces data to store them in the database
In utils/Constant.py
, we have several constants:
DIVAR_ADDON
: Use this to add query options to the scraper URL, likeIDENTITY_VERIFIED
, to filter ads accordingly.- Other settings include scroll speed and behavior during scraping.
This is the main part of the program:
asyncio.run(main(
Setting.DEFAULT_SAVE_METHOD,
check_page=check_page,
pages=get_page(start=1, page=2, step=1),
interval_sec=10,
max_iterations=None
))
Here, you can configure how your scraper operates. For example, do you want it to always visit pages 1 and 2? Why? Because when you scroll down during scraping, new data might arrive. Rechecking the first pages helps ensure you're collecting fresh data for your dataset. 😊