- Introduction
- Data Description
- Data Exploration and Preprocessing
- Unsupervised Model Training and Evaluation
- Model Selection and Key Findings
- Comparison with K-Means and Hierarchical Clustering
- Investment Strategy Implications
- Next Steps
- Final Takeaway
This project applies unsupervised learning to segment S&P 500 stocks based on financial characteristics such as returns, volatility, P/E ratio, dividend yield, and market capitalization. The objective is to help investors and portfolio managers optimize investment strategies by identifying market patterns beyond traditional sector-based allocations.
Using DBSCAN, K-Means, and Hierarchical Clustering, this analysis uncovers distinct stock clusters and high-risk outliers, offering a deeper understanding of portfolio risks and opportunities.
The dataset consists of historical stock data (2021–2024) for ten major S&P 500 companies:
AAPL, MSFT, GOOGL, AMZN, TSLA, NVDA, META, NFLX, JPM, XOM.
- Daily Returns & Rolling Volatility – Measures price fluctuations.
- P/E Ratio – Evaluates stock valuation.
- Dividend Yield – Identifies income-generating stocks.
- Market Capitalization – Classifies stock sizes.
- Feature Engineering:
- Computed 30-day rolling volatility and average daily returns.
- Imputed missing fundamental data using sector-based median values.
- Feature Scaling & Dimensionality Reduction:
- Applied StandardScaler to normalize data.
- Used PCA (Principal Component Analysis) to reduce features to two components (PC1 & PC2), explaining 70.5% of the variance.
Three clustering models were trained and evaluated:
- Parameters: eps = 0.9, min_samples = 3
- Findings:
- Identified 27% of stocks as noise/outliers, including NVDA, TSLA, NFLX.
- Captured high-volatility, high-growth stocks that standard clustering missed.
- Optimal k = 3 (Silhouette Score: 0.5090)
- Findings:
- Assigned all stocks to clusters but failed to detect outliers.
- Optimal k = 3 (Silhouette Score: 0.5133)
- Findings:
- Improved segmentation, but lacked DBSCAN’s ability to handle outliers.
DBSCAN provided the most insightful clustering results:
- Isolated high-risk outliers (NVDA, TSLA, NFLX) that traditional methods misclassified.
- Grouped stable, dividend-paying stocks (JPM, XOM, KO) into distinct low-risk clusters.
- Provided a more flexible segmentation approach without forcing stocks into predefined clusters.
Cluster | Stocks | Characteristics |
---|---|---|
Cluster 0 | XOM, JPM | Low volatility, high stability, dividend-paying stocks. |
Cluster 1 | AAPL, AMZN, GOOGL, META, MSFT | Growth-oriented tech stocks, moderate risk. |
Outliers (-1) | NVDA, TSLA, NFLX | High volatility, high-growth potential, risk-prone. |
Clustering Method | k | Silhouette Score | Outlier Detection | Insights |
---|---|---|---|---|
K-Means | 3 | 0.5090 | ❌ No | Forced segmentation; grouped tech stocks together. |
Hierarchical | 3 | 0.5133 | ❌ No | Better segmentation, but still lacked outlier detection. |
DBSCAN | - | N/A | ✅ Yes | Identified 30% of stocks as outliers, better capturing market nuances. |
- Outliers (TSLA, NVDA, NFLX) require caution due to extreme volatility.
- These stocks offer high reward potential but need tailored risk management.
- AAPL, AMZN, GOOGL, META, MSFT provide stable growth with moderate risk.
- JPM, XOM offer low volatility, making them ideal for defensive strategies.
🔹 Supervised Learning: Build a classifier to predict cluster membership for new stocks.
🔹 Outlier Deep-Dive: Analyze financial factors driving outlier classification.
🔹 Dynamic Clustering: Assess how stock clusters change over time.
🔹 Feature Enrichment: Incorporate news sentiment, analyst ratings for better predictions.
🔹 Parameter Optimization: Fine-tune DBSCAN’s eps and min_samples for better segmentation.
- DBSCAN’s density-based clustering revealed key outliers that traditional methods overlooked.
- It provides a powerful tool for portfolio diversification and risk-aware investing.
- This framework helps investors refine their strategies based on real stock behavior rather than predefined sector classifications.
🚀 Stay ahead of the market with data-driven insights!
📁 Stock-Market-Clustering-and-Predictive-Analysis
│── 📄 README.md
│── 📄 requirements.txt
│── 📄 clustering_analysis.ipynb
│── 📄 data_preprocessing.ipynb
│── 📄 visualization.ipynb
│── 📄 results_summary.md
│── 📂 data/
│ ├── historical_stock_data.csv
│ ├── processed_data.csv
│── 📂 models/
│ ├── dbscan_model.pkl
│ ├── kmeans_model.pkl
│ ├── hierarchical_model.pkl