Skip to content

This project scrapes and analyzes auction data from Christie's and Sotheby's using Python, Selenium, and Pandas. It explores factors influencing artwork prices, particularly auction house estimates and artist popularity (measured via Yahoo search trends). It includes multi-stage scrapers, data cleaning notebooks, and exploratory data analysis.

Notifications You must be signed in to change notification settings

marcusrprojects/What-Makes-Art-Valuable

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

What Makes Art Valuable: Data Scraping and Analysis of Auction Data

Language pandas Selenium

Project Overview

This project explores factors influencing artwork prices at major auction houses (Christie's and Sotheby's). Using custom web scrapers built with Python and Selenium, auction data was collected, including artwork details, estimates, and final sale prices. The data was then cleaned and processed using Pandas to analyze trends, particularly the relationship between auction house estimates, artist popularity (approximated via Yahoo search results), and final sale prices.

Key Features

  • Multi-Stage Web Scraping: Python scripts utilizing Selenium to navigate dynamic auction sites, collect auction/artwork URLs, and extract specific artwork features (price, artist, estimates, dimensions, etc.).
  • Data Cleaning & Processing: Jupyter Notebooks demonstrating data cleaning techniques with Pandas, including:
    • Handling inconsistencies in scraped data.
    • Parsing and separating estimate ranges (low/high).
    • Standardizing and converting currencies (GBP, EUR, HKD, etc.) to USD.
    • Filtering out non-painting/print lots.
  • Feature Engineering:
    • Calculation of artist age and determination of living status.
    • Creation of binary 'Sold' status based on price data.
    • Calculation of estimate accuracy (whether the final price fell below, within, or above the estimate range).
    • Integration of artist popularity metric derived from scraping Yahoo search result counts using Requests and BeautifulSoup.
  • Exploratory Data Analysis (EDA): Initial visualizations exploring the relationship between sale prices, estimates (confirming underestimation bias and anchoring effects), and artist popularity.
  • (Experimental) An included notebook explores image classification using Keras/TensorFlow (VGG16), though this feature was not integrated into the final analysis.

Key Technologies

  • Python: Core programming language.
  • Selenium: Web browser automation and scraping dynamic websites.
  • Pandas: Data manipulation, cleaning, and analysis.
  • NumPy: Numerical operations.
  • Requests & BeautifulSoup: Scraping static content (used for Yahoo search results).
  • Jupyter Notebook: Development environment for scraping, cleaning, and analysis.
  • Matplotlib & Seaborn: Data visualization.
  • (Experimental): Keras / TensorFlow

Key Findings

  • A strong correlation was observed between auction house estimates (both low and high) and the final sale price, suggesting a potential anchoring bias effect.
  • The analysis indicated a tendency for the auction house (Christie's data was primarily used for this part) to underestimate artwork values, with a significant percentage (~54%) selling above the high estimate.
  • Artist popularity, as approximated by Yahoo search result counts, did not show a strong correlation with final sale prices within this dataset.

Challenges

  • Data Collection: Navigating the complexities of Selenium for dynamic websites and handling inconsistencies across different auction/artwork page layouts.
  • Data Cleaning: Significant effort was required to standardize formats, currencies, and filter out irrelevant lots (e.g., furniture).
  • Scope: Difficulty in reliably filtering only paintings/prints and excluding medium as a feature might introduce noise into the analysis (e.g., comparing a Picasso print to a painting).

Usage Note

The web scrapers were developed based on the website structures of Christie's and Sotheby's at the time of the project's creation. Websites change frequently, so these scrapers will likely require significant updates to function correctly now. The cleaned data files (.csv) are provided for direct analysis.

Project Files

Data Collection and Cleaning

Data Processing and Visualization

Further Reading

For a more detailed discussion of the methodology, findings, and visualizations, please see the project blog post PDF.

About

This project scrapes and analyzes auction data from Christie's and Sotheby's using Python, Selenium, and Pandas. It explores factors influencing artwork prices, particularly auction house estimates and artist popularity (measured via Yahoo search trends). It includes multi-stage scrapers, data cleaning notebooks, and exploratory data analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published