๐ Best Model Accuracy (Random Forest with SMOTE): 87.9%
As someone transitioning into data science with an interest in the banking and finance sector, I sought to take on a project that combines both business relevance and technical depth. This project focuses on predicting startup success using Crunchbase data. I approached it from three key angles:
- Regression to understand funding patterns
- Classification to predict startup outcomes
- Clustering to discover natural groupings among startups
The goal was to determine the factors that increase the likelihood of a startup's success โ measured by IPOs, acquisitions, or a combination of both โ while incorporating both machine learning techniques and business intelligence.
- Python Version: 3.11
- IDE: Jupyter Notebook
- Libraries:
pandas
,numpy
,matplotlib
,seaborn
,statsmodels
,scikit-learn
,imbalanced-learn
- Dataset: An excerpt of a dataset released to the public by Crunchbase
- Filtered out companies younger than 3 years or older than 7 years to reduce noise (from 31,000+ to ~9,400 observations)
- Created a new
status
label:Success
= Acquired, IPO, or bothFailure
= Closed or no exit
- Engineered
number_degrees
from MBA, PhD, MS, and Other - Dropped columns with high missing values (e.g.,
acquired_companies
,products_number
) - Handled missing values selectively
- Removed funding outliers (above 99th percentile)
EDA focused on understanding class balance and numerical relationships:
- Status Value Counts: Success vs Failure
- Log-Transformed Funding Distribution
- Correlation Heatmap of numeric variables



- Dependent variable:
average_funded
- Significant predictors:
average_participants
,number_degrees
,ipo
offices
andis_acquired
were not statistically significant

- Repeated regression using Scikit-learnโs
LinearRegression
- Log-transformed
average_funded
for normality - Coefficients confirmed the importance of
average_participants
andipo

- Initially trained on imbalanced classes
- High overall accuracy but poor recall for โSuccessโ
- Ineffective at identifying successful startups


- Used SMOTE to balance the dataset
- Improved recall and precision for the โSuccessโ class
- More reliable at flagging high-potential startups


- Used SMOTE-balanced dataset
- Boosted accuracy from 70.7% to 87%
- Strong performance across all evaluation metrics



- Tuned
n_estimators
,max_depth
, andmin_samples_split
usingGridSearchCV
- Slight performance improvement
- Feature importance rankings remained stable


For a final unsupervised learning step, I applied KMeans Clustering to group startups based on:
category_code
(encoded)average_funded
average_participants
- Scaled all features using
StandardScaler
- Used the elbow method to determine 3 optimal clusters
- Visualized results using 2D scatter plots



- Startups in similar sectors tend to attract similar funding amounts
- Higher
average_participants
is associated with higher funding clusters - These clusters help spot high-potential startups independent of IPO or acquisition status
This end-to-end machine learning project gave me the chance to:
- Work with a real-world business dataset
- Build interpretable and high-performing models
- Apply data cleaning, feature engineering, and EDA effectively
- Handle class imbalance with SMOTE
- Improve model performance through hyperparameter tuning
- Add unsupervised clustering to surface hidden patterns