This project provides a comprehensive statistical analysis of the diamonds dataset using R programming language. It covers a wide range of statistical techniques from basic descriptive statistics to advanced modeling and predictive analysis.
We've created an interactive web application based on our statistical analysis that allows users to estimate diamond prices:
This user-friendly web application includes:
- Interactive pricing tool that uses our regression models
- Educational content about the 4Cs of diamonds
- Visualizations of how each factor affects diamond prices
- Mobile-responsive design for all devices
The app is built with React and Material UI, making it accessible to consumers and industry professionals alike. The source code for the web application is in the diamond-app
directory of this repository.
The analysis is structured to match the topics covered in the university's Probability and Statistics course, including:
- Data exploration and summary statistics
- Data visualization techniques
- Probability distributions
- Correlation analysis
- Regression modeling
- Hypothesis testing
- ANOVA
Dataset/diamonds.csv
: The dataset containing diamond characteristics and pricessrc/utils.R
: Utility functions used across the analysis scriptsanalysis/
: Main R analysis scriptsmain_analysis.R
: Fundamental statistical analysisadvanced_modeling.R
: Advanced statistical modeling and predictionvisualization_insights.R
: Detailed visualizations and insights
plots/
: Generated visualizationshigh_quality/
: Publication-quality visualizations
run_analysis.R
: Main script to run the entire analysisLab Files/
: Reference materials from practical labsdiamond-app/
: Interactive web application for diamond price estimation
- R (version 4.0 or higher)
- RStudio (recommended for viewing and running the scripts)
- Node.js and npm (for running the web application)
The following packages are required to run all analyses. The scripts will attempt to install missing packages automatically:
- tidyverse
- ggplot2
- dplyr
- corrplot
- car
- stats
- moments
- scales
- rpart
- rpart.plot
- randomForest
- caret
- e1071
- reshape2
- ggthemes
- GGally
- viridis
- plotly
- reshape2
- ggridges
- htmlwidgets
If you prefer to install all required packages in advance:
packages <- c(
"tidyverse", "ggplot2", "dplyr", "corrplot", "car", "stats",
"moments", "rpart", "rpart.plot", "randomForest", "caret", "e1071",
"ggthemes", "GGally", "viridis", "plotly", "reshape2", "ggridges",
"scales", "htmlwidgets"
)
install.packages(packages, dependencies = TRUE, repos = "https://cran.r-project.org")
- Open RStudio
- Open the
run_analysis.R
file - Click "Run" or press Ctrl+Shift+Enter to run the entire script
This will execute all analyses in sequence and generate all plots and output files.
You can also run each analysis script separately for more focused analysis:
- Open RStudio
- First run
source("src/utils.R")
to load utility functions - Then run any of the analysis scripts:
source("analysis/main_analysis.R")
- Basic statistical analysissource("analysis/advanced_modeling.R")
- Advanced modelingsource("analysis/visualization_insights.R")
- Visualizations
To run the Diamond Price Estimator web application locally:
-
Navigate to the
diamond-app
directory:cd diamond-app
-
Install dependencies:
npm install
-
Start the development server:
npm start
-
Open http://localhost:3000 in your browser to view the app
- Plots: Various statistical visualizations will be saved in the
plots/
directory - Analysis Results: CSV files with analysis results will be saved in the
analysis/
directory - Console Output: Detailed statistical results will be printed to the R console
- Web Application: An interactive diamond price estimator tool
The analysis generates a comprehensive set of visualizations to help understand the diamonds dataset. Below is a description of the key plots and what insights they provide:
-
price_histogram.png: Histogram showing the distribution of diamond prices. The right-skewed pattern indicates that most diamonds are in the lower price range, with fewer very expensive diamonds.
-
carat_histogram.png: Distribution of diamond carat weights. Shows common weight thresholds that may reflect market preferences.
-
price_by_cut_boxplot.png: Box plots comparing diamond prices across different cut qualities. Helps identify if better cuts command higher prices.
-
price_by_clarity_boxplot.png: Box plots showing how clarity affects diamond pricing. Clearer diamonds (higher clarity grades) generally have higher median prices.
-
price_vs_carat_scatter.png: Scatter plot of price against carat with a trend line. Shows the strong positive relationship between a diamond's weight and its price.
-
price_vs_carat_by_cut.png: Faceted scatter plots showing how the price-carat relationship varies across different cut qualities.
-
price_qq_plot.png: Quantile-Quantile plot for diamond prices, evaluating if prices follow a normal distribution.
-
price_normality_hist.png: Histogram with normal curve overlay, showing how price distribution deviates from normality.
-
log_price_qq_plot.png and log_price_normality_hist.png: Similar plots for log-transformed prices, typically showing better normality.
- correlation_matrix.png: Heat map displaying correlations between numerical variables. Intense colors indicate stronger relationships (positive or negative).
-
advanced_model_residuals.png: Residual plot from the regression model. Patterns may indicate model limitations or suggest transformations.
-
decision_tree.png: Visualization of the decision tree model for predicting diamond prices, showing key decision points based on features.
-
variable_importance.png: Bar chart from the random forest model showing which variables are most important in predicting diamond prices.
-
cut_price_comparison.png: Density plots comparing price distributions between different cuts, often used in t-tests.
-
cut_color_heatmap.png: Visualization of the association between cut and color, related to chi-square test results.
-
anova_cut_price.png: Violin plots comparing price distributions across cuts, visualizing ANOVA results.
The plots/high_quality/
directory contains more refined versions of key visualizations:
-
cut_violin_plot.png: Enhanced violin plots showing the distribution of prices across different cut qualities.
-
multidim_scatter_plot.png: Multi-dimensional scatter plot showing relationships between price, carat, clarity, and depth.
-
correlation_heatmap.png: Detailed correlation matrix with hierarchical clustering to group related variables.
-
interactive_scatter.html: Interactive plot allowing exploration of the relationship between price, carat, and other variables.
-
facet_histograms.png: Grid of histograms showing price distributions across different combinations of cut and color.
-
parallel_coords.png: Parallel coordinate plot for multi-dimensional analysis of numerical variables.
-
ridgeline_plot.png: Overlapping density curves showing price distribution by clarity.
-
price_bubble_plot.png: Bubble chart displaying average prices by cut, color, and clarity combinations.
-
price_per_carat_boxplot.png: Box plots comparing the price-per-carat ratio across different cuts.
-
value_analysis_hist.png: Histogram identifying potentially overpriced and underpriced diamonds.
-
dimension_pairs.png: Scatter plot matrix showing relationships between diamond dimensions.
-
market_share_pie.png: Pie chart showing the market share by cut.
-
dashboard_summary.png: Summary dashboard with key metrics about the diamond dataset.
The visualizations collectively reveal several important patterns:
-
Price Determinants: The plots show that carat weight is the strongest predictor of price, followed by clarity and cut quality.
-
Non-linear Relationship: The price-carat relationship is non-linear, with larger diamonds commanding disproportionately higher prices.
-
Quality Premium: Better cut quality generally commands a higher price-per-carat, visible in the box plots.
-
Market Distribution: The distribution plots reveal market concentrations at specific carat thresholds (0.5, 1.0, etc.).
-
Feature Interactions: Multiple plots demonstrate how cut, color, and clarity interact to influence pricing beyond their individual effects.
-
Statistical Validation: The QQ-plots and residual analysis confirm that log-transformation improves the normality of price data.
-
Predictive Models: The decision tree and random forest visualizations explain which factors matter most when predicting diamond prices.
-
Package Installation Errors: If you encounter package installation errors, try running:
install.packages("package_name", dependencies=TRUE, repos="https://cran.r-project.org")
-
Working Directory Issues: If files aren't found, ensure your working directory is set to the project root:
setwd("/path/to/project_directory")
-
Memory Issues: If you encounter memory issues with large dataset operations, restart R with:
.rs.restartR()
After running the basic analysis, you can:
- Modify the scripts to explore different aspects of the diamond dataset
- Change visualization parameters to highlight different patterns
- Experiment with different modeling techniques in the advanced_modeling.R script
- Export the high-quality visualizations for use in reports or presentations
- Extend the web application with additional features and visualizations