This project analyzes Olympic athlete data from 2008, 2012, and 2016 Summer Olympics, focusing on identifying and understanding outliers in various sports comparing genders based on athletes' physical attributes. Using advanced machine learning techniques, we explore how different sports deviate from the norm in terms of age, height, weight, and BMI characteristics of their athletes.
-
Data Preprocessing: Cleaned and prepared Olympic athlete data, handling duplicates and ensuring comparable sports across genders.
-
Exploratory Data Analysis:
- Visualized athlete counts by sport and gender using interactive Plotly graphs.
- Created heatmaps to display average age, height, weight, and BMI across sports.
-
Feature Engineering:
- Calculated BMI for athletes.
- Created rank difference features to compare male and female athletes within each sport.
-
Outlier Detection:
- Application of machine learning models (DBSCAN and Isolation Forest) for outlier detection
- Principal Component Analysis (PCA) for dimensionality reduction and visualization
- Python
- Pandas & NumPy for data manipulation
- Matplotlib, Seaborn, and Plotly for visualization
- Scikit-learn for machine learning algorithms (DBSCAN, Isolation Forest, PCA)
- Significant variations in gender differences were observed across different sports for age, height, weight, and BMI.
- The analysis revealed patterns in how gender differences manifest in various Olympic sports.
- In many sports, there are physical attribute similarities regardless of gender.
- Boxing was consistently identified as an outlier sport in terms of comparing genders in multiple athletic metrics.
- DBSCAN and Isolation Forest models' results are nearly identical.
- Z score results also support our argument.
Total athletes in dataset
Male athlete averages
Female athlete averages
Correlations between different attributes
Adjusting DBSCAN model parameters for optimization. Silhoutte score is one of the indicators of model success.
Isolation Forest Sensitivity Analysis
DBSCAN Clustering
Isolation Forest Clustering
Z Score Results