A web crawler that collects hotel data and reviews from Booking.com to generate the JSON dataset.
-
Github repo link: https://github.com/MAPLELEAF3659/booking-web-crawler
-
Made by MAPLELEAF3659
- Python version: 3.12.2
- Packages (see requirements.txt for full packages)
- beautifulsoup4: 4.12.3
- selenium: 4.27.1
- tqdm: 4.67.1
- WebDriver: Chrome
-
Create virtual environment
pip install virtualenv virtualenv venv
-
Enter virtual environment
- Windows Powershell
.\venv\Scripts\Activate.ps1
- Linux
source venv/Scripts/activate -
Install required packages
pip install -r requirements.txt
-
Run main.py
py main.py --search "東京澀谷" --check_in 2025-02-01 --check_out 2025-02-02 --num_adults 2 --num_children 0 --num_rooms 1-
Command arguments
short name full name description type default value required? -s--searchkeywords of search. str None Y -ci--check_incheck-in date. format: yyyy-MM-ddstr None Y if check_outexisted-co--check_outcheck-out date. format: yyyy-MM-ddstr None Y if check_inexisted-na--num_adultsnumber of adults. int 2 N -nc--num_childrennumber of children. int 0 N -nr--num_roomsnumber of rooms. int 1 N -mi--max_itemmax item for web-crawling int 999 N -mp--max_pagemax review page in an item int 999 N
-
-
The results will save in
.jsonat./result/
The output of result data will be an array of BookingData object in json format. Each BookingData object represents hotel information, including its properties, star (Star object, including count and type), and user reviews (including OverallRating object, count, count_crawled, and array of Review object). The detail of the structure are described as following sections.
-
Overall example:
[ BookingData, { "name": "hotel name", "address": "hotel address", "slogan": "hotel slogan", "description": "hotel description", "star": { -> Star "count": 5, "type": "official" }, "user_review": { "overall_rating": { -> OverallRating "type": "booking", "average": 8.2, "staff": 9.1, "facilities": 8.2, "cleanliness": 8.4, "comfort": 8.5, "value": 8.0, "location": 9.4, "wifi": 9.0 }, "count": 2344, "count_crawled": 500, "reviews": [ Review, { "user_name": "user name", "user_type": "user type", "country": "user country", "room_name": "room name", "num_night": 1, "stay_date": "yyyy-MM", "review_date": "yyyy-MM-dd", "title": "review title", "positive_description": "positive review description", "negative_description": "negative review description", "rating": 10.0 }, ... ] } }, ... ]
The BookingData object represents a hotel's core information, rating summary, and reviews.
Fields:
-
name: (Optional, String) - The name of the hotel. -
address: (Optional, String) - The address of the hotel. -
slogan: (Optional, String) - The hotel's slogan or tagline. -
description: (Optional, String) - A brief description of the hotel. -
star: (Star) - Star rating details. -
user_review: (UserReview) - Summary and detailed user reviews.
The star object provides information about the hotel's star rating.
Fields:
-
count: (Optional, Integer) - The star rating of the hotel (range: 0-5). -
type: (Optional, String) - Type of the star rating (e.g., "official", "booking").
The user_review object represents user feedback and reviews for the hotel.
Fields:
-
overall_rating: (OverallRating) - Aggregated ratings of the hotel. -
count: (Optional, Integer) - Total number of reviews. -
count_crawled: (Optional, Integer) - Number of crawled reviews. -
reviews: (List[Review]) - Detailed individual reviews.
The overall_rating object contains aggregated scores for various attributes of the hotel.
*All value expects type are ranged in 0.0~10.0.
Fields:
-
type: (Optional, String) - Rating type. (options: "booking", "external") -
average: (Optional, Float) - Average overall rating. -
staff: (Optional, Float) - Rating for the hotel staff. -
facilities: (Optional, Float) - Rating for facilities. -
cleanliness: (Optional, Float) - Rating for cleanliness. -
comfort: (Optional, Float) - Rating for comfort. -
value: (Optional, Float) - Rating for value for money. -
location: (Optional, Float) - Rating for location. -
wifi: (Optional, Float) - Rating for Wi-Fi quality.
The Review object represents a detailed review from a user.
Fields:
-
user_name: (Optional, String) - Name of the reviewer. -
user_type: (Optional, String) - Type of user (options: "single", "family", "couple", "group"). -
country: (Optional, String) - Reviewer's country of origin. -
room_name: (Optional, String) - Name or type of the room stayed in. -
num_stay_night: (Optional, Integer) - Number of nights stayed. -
stay_date: (Optional, String) - Date of stay. (format: yyyy-MM) -
review_date: (Optional, String) - Date of review. (format: yyyy-MM-dd) -
title: (Optional, String) - Title of the review. -
positive_description: (Optional, String) - Positive aspects mentioned in the review. -
negative_description: (Optional, String) - Negative aspects mentioned in the review. -
rating: (Optional, Float) - Overall rating provided by the user (range: 0.0-10.0).