Skip to content

Machine learning models trained on U.S. Census data to classify income levels, including preprocessing, PCA, SVD, and evaluation with MLP, logistic regression, and Naïve Bayes.

Notifications You must be signed in to change notification settings

Yashasvi1714/Predictive_modelling-on-cenus-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

🧠 Income Classification with Census Data

Python Scikit-learn License: MIT

This project implements predictive models to classify adult income based on the U.S. Census Bureau data. The goal is to predict whether a person earns more than $50K per year using machine learning techniques.


📁 Files in This Repository

File Description
FinalProject.ipynb Jupyter Notebook with complete code and outputs
README.md Project overview and usage instructions
adult-dataset.csv The data is in the file "adult-dataset.csv". It was extracted from the census bureau database, found at: http://www.census.gov/ftp/pub/DES/www/welcome.html

📊 Dataset Overview

The dataset contains demographic and employment-related attributes for U.S. adults. The target variable is Income:

  • <=50K or >50K

Features include:

  • Age (int)
  • Work Class (categorical)
  • Education (categorical)
  • Marital Status (categorical)
  • Occupation (categorical)
  • Race (categorical)
  • Sex (binary)
  • Hours per week (int)

🧹 Part A: Data Cleaning

  • Handled missing data by removing or imputing invalid entries
  • Converted categorical variables using one-hot encoding
  • Removed irrelevant columns
  • Ensured numeric data for model compatibility

📉 Part B: Dimensionality Reduction

Applied both SVD and PCA to reduce dataset dimensions.

✅ Results:

  • Explained Variance: PCA components capturing >90% of varianc

🤖 Part C: Model Training & Evaluation

1. Multi-Layer Perceptron (Neural Network)

  • Built with MLPClassifier
  • Evaluated using Confusion Matrix & Classification Report

2. Logistic Regression

  • Applied using Scikit-learn
  • Good baseline model

3. Naïve Bayes (GaussianNB)

  • Fast and interpretable model

4. K-Means Clustering

  • Unsupervised clustering excluding Income

🧪 Results Summary

Model Accuracy Precision Recall F1 Score
MLP ✅ High ✅ High ✅ High ✅ High
Logistic Regression Moderate Moderate Moderate Moderate
Naïve Bayes Moderate Lower Moderate Moderate
K-Means - - - - (unsupervised)

🚀 Getting Started

🔧 Install Dependencies

pip install numpy pandas matplotlib seaborn scikit-learn

About

Machine learning models trained on U.S. Census data to classify income levels, including preprocessing, PCA, SVD, and evaluation with MLP, logistic regression, and Naïve Bayes.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published