Skip to content

A model that predicts startup success from data on early-stage investments in the Crunchbase database.

Notifications You must be signed in to change notification settings

abibatoki/Classification-Model

Repository files navigation

๐Ÿš€ Predicting Startup Success with Crunchbase Data

๐Ÿš€ Best Model Accuracy (Random Forest with SMOTE): 87.9%

๐Ÿ“Œ Project Overview

As someone transitioning into data science with an interest in the banking and finance sector, I sought to take on a project that combines both business relevance and technical depth. This project focuses on predicting startup success using Crunchbase data. I approached it from three key angles:

  • Regression to understand funding patterns
  • Classification to predict startup outcomes
  • Clustering to discover natural groupings among startups

The goal was to determine the factors that increase the likelihood of a startup's success โ€” measured by IPOs, acquisitions, or a combination of both โ€” while incorporating both machine learning techniques and business intelligence.


๐Ÿงฐ Code and Resources Used

  • Python Version: 3.11
  • IDE: Jupyter Notebook
  • Libraries: pandas, numpy, matplotlib, seaborn, statsmodels, scikit-learn, imbalanced-learn
  • Dataset: An excerpt of a dataset released to the public by Crunchbase

๐Ÿงน Data Cleaning

  • Filtered out companies younger than 3 years or older than 7 years to reduce noise (from 31,000+ to ~9,400 observations)
  • Created a new status label:
    • Success = Acquired, IPO, or both
    • Failure = Closed or no exit
  • Engineered number_degrees from MBA, PhD, MS, and Other
  • Dropped columns with high missing values (e.g., acquired_companies, products_number)
  • Handled missing values selectively
  • Removed funding outliers (above 99th percentile)

๐Ÿ“Š Exploratory Data Analysis (EDA)

EDA focused on understanding class balance and numerical relationships:

  • Status Value Counts: Success vs Failure
  • Log-Transformed Funding Distribution
  • Correlation Heatmap of numeric variables
image image Correlation Heatmap

๐Ÿ“ˆ Regression Modeling

๐Ÿ”น Multiple Linear Regression (Statsmodels)

  • Dependent variable: average_funded
  • Significant predictors: average_participants, number_degrees, ipo
  • offices and is_acquired were not statistically significant
image

๐Ÿ”น Linear Regression (Scikit-learn)

  • Repeated regression using Scikit-learnโ€™s LinearRegression
  • Log-transformed average_funded for normality
  • Coefficients confirmed the importance of average_participants and ipo
image

๐Ÿงช Classification Modeling

๐Ÿ”ธ Logistic Regression (Baseline โ€“ No SMOTE)

  • Initially trained on imbalanced classes
  • High overall accuracy but poor recall for โ€œSuccessโ€
  • Ineffective at identifying successful startups
image image

๐Ÿ”ธ Logistic Regression (With SMOTE)

  • Used SMOTE to balance the dataset
  • Improved recall and precision for the โ€œSuccessโ€ class
  • More reliable at flagging high-potential startups
image image

๐ŸŒฒ Random Forest Classification

  • Used SMOTE-balanced dataset
  • Boosted accuracy from 70.7% to 87%
  • Strong performance across all evaluation metrics
image image feature_importance

๐Ÿ› ๏ธ Hyperparameter Tuning

  • Tuned n_estimators, max_depth, and min_samples_split using GridSearchCV
  • Slight performance improvement
  • Feature importance rankings remained stable
image image

๐Ÿ“š Clustering Analysis (KMeans)

For a final unsupervised learning step, I applied KMeans Clustering to group startups based on:

  • category_code (encoded)
  • average_funded
  • average_participants

Steps:

  • Scaled all features using StandardScaler
  • Used the elbow method to determine 3 optimal clusters
  • Visualized results using 2D scatter plots
image image image

Key Insights:

  • Startups in similar sectors tend to attract similar funding amounts
  • Higher average_participants is associated with higher funding clusters
  • These clusters help spot high-potential startups independent of IPO or acquisition status

๐Ÿ’ก Final Thoughts

This end-to-end machine learning project gave me the chance to:

  • Work with a real-world business dataset
  • Build interpretable and high-performing models
  • Apply data cleaning, feature engineering, and EDA effectively
  • Handle class imbalance with SMOTE
  • Improve model performance through hyperparameter tuning
  • Add unsupervised clustering to surface hidden patterns

About

A model that predicts startup success from data on early-stage investments in the Crunchbase database.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published