This project is a web scraping tool designed to extract product information from an e-commerce site based on specific search terms. The tool automates the process of data collection and organization for further analysis.
Run the following command to install the required modules:
pip install -r requirements.txtThe complete scraping script is located in the scraper.ipynb file. Execute it sequentially from the beginning.
You can use the provided search terms in ./queries/M11207321_queries.txt or prepare your own.
If you want to use custom search terms, ensure they are saved in a .txt file, with one search term per line:
筆電
衣服
餅乾
洗衣精
衛生紙
...
Open scraper.ipynb and locate the Parameter Setting section. Here, you can define or modify the following parameters:
- student_id: Your student ID.
- query_path: The path to the search terms.
- results_path: The path where scraping results will be saved.
- search_url: The e-commerce site URL to scrape (must be the Taiwan Coupang page).
- short_time_sleep: Short wait time.
- medium_time_sleep: Medium wait time.
- long_time_sleep: Long wait time.
During the scraping process, ensure you collect the following product information:
- Product Name
- Product Price
- Product URL
Save the collected data as a .csv file, including the following columns:
- product_name: Product name
- product_price: Product price
- product_url: Product URL
Ensure the .csv file is encoded in UTF-8-SIG.
After scraping product data for each search term, save the results using the following file naming convention: StudentID_QueryName.csv, e.g., M11207321_口罩.csv (if your student ID contains letters, use uppercase letters).
If you need to submit the scraped data, place all result files for the search terms into a folder named after your student ID, then compress the folder into a .zip file named after your student ID, e.g., M11207321.zip (if your student ID contains letters, use uppercase letters).
- While scraping, you may open other windows, but do not close or minimize the Chrome window running the scraper (important).
- Ensure that the screen remains on during the scraping process (important).