This project explores factors influencing artwork prices at major auction houses (Christie's and Sotheby's). Using custom web scrapers built with Python and Selenium, auction data was collected, including artwork details, estimates, and final sale prices. The data was then cleaned and processed using Pandas to analyze trends, particularly the relationship between auction house estimates, artist popularity (approximated via Yahoo search results), and final sale prices.
- Multi-Stage Web Scraping: Python scripts utilizing Selenium to navigate dynamic auction sites, collect auction/artwork URLs, and extract specific artwork features (price, artist, estimates, dimensions, etc.).
- Data Cleaning & Processing: Jupyter Notebooks demonstrating data cleaning techniques with Pandas, including:
- Handling inconsistencies in scraped data.
- Parsing and separating estimate ranges (low/high).
- Standardizing and converting currencies (GBP, EUR, HKD, etc.) to USD.
- Filtering out non-painting/print lots.
- Feature Engineering:
- Calculation of artist age and determination of living status.
- Creation of binary 'Sold' status based on price data.
- Calculation of estimate accuracy (whether the final price fell below, within, or above the estimate range).
- Integration of artist popularity metric derived from scraping Yahoo search result counts using Requests and BeautifulSoup.
- Exploratory Data Analysis (EDA): Initial visualizations exploring the relationship between sale prices, estimates (confirming underestimation bias and anchoring effects), and artist popularity.
- (Experimental) An included notebook explores image classification using Keras/TensorFlow (VGG16), though this feature was not integrated into the final analysis.
- Python: Core programming language.
- Selenium: Web browser automation and scraping dynamic websites.
- Pandas: Data manipulation, cleaning, and analysis.
- NumPy: Numerical operations.
- Requests & BeautifulSoup: Scraping static content (used for Yahoo search results).
- Jupyter Notebook: Development environment for scraping, cleaning, and analysis.
- Matplotlib & Seaborn: Data visualization.
- (Experimental): Keras / TensorFlow
- A strong correlation was observed between auction house estimates (both low and high) and the final sale price, suggesting a potential anchoring bias effect.
- The analysis indicated a tendency for the auction house (Christie's data was primarily used for this part) to underestimate artwork values, with a significant percentage (~54%) selling above the high estimate.
- Artist popularity, as approximated by Yahoo search result counts, did not show a strong correlation with final sale prices within this dataset.
- Data Collection: Navigating the complexities of Selenium for dynamic websites and handling inconsistencies across different auction/artwork page layouts.
- Data Cleaning: Significant effort was required to standardize formats, currencies, and filter out irrelevant lots (e.g., furniture).
- Scope: Difficulty in reliably filtering only paintings/prints and excluding medium as a feature might introduce noise into the analysis (e.g., comparing a Picasso print to a painting).
The web scrapers were developed based on the website structures of Christie's and Sotheby's at the time of the project's creation. Websites change frequently, so these scrapers will likely require significant updates to function correctly now. The cleaned data files (.csv
) are provided for direct analysis.
- Stage1_Christies_Scraper.ipynb: Data collection and initial scraping.
- Stage2-Christies_Scraper.ipynb: Further data collection and scraping.
- Christies_Art_Objects_Clean.csv: Cleaned data from Christie's.
- Christies_data with popularity.csv: Data from Christie's with artist popularity measures.
- SothebysData_clean.csv: Cleaned data from Sotheby's.
- Art_Object_Info2.csv: Processed data with added features.
- SothebysData.csv: Data from Sotheby's.
- Sotheby's Scraper .ipynb: Scraping data from Sotheby's.
- Art_Object_URL.csv: URLs for art objects.
- Christies Data Visualization.ipynb: Visualizations of Christie's data.
- Christies_Art Data Cleaner.ipynb: Data cleaning for Christie's data.
- Christies_Art Data Cleaner_With Day, Month, Year.ipynb: Detailed data cleaning for Christie's data.
- Data Cleaner_Christies.ipynb: Data cleaning and processing for Christie's data.
- Data Visualization_Christies.ipynb: Further data visualization for Christie's data.
- ImageClassifier.ipynb: Image classification experiments.
- Sothebys_Data_Cleaner.ipynb: Data cleaning for Sotheby's data.
- What Makes Art Valuable_ Data Scraping and Exploratory Data Visualizations.pdf: The blog post about the project.
- relative popularity.ipynb: Notebook for analyzing artist popularity.
For a more detailed discussion of the methodology, findings, and visualizations, please see the project blog post PDF.