Check it out : Students Math Score
This repository contains code for building a machine learning model to predict students' math scores based on various features such as gender, race/ethnicity, parental level of education, lunch type, and test preparation course.
The aim is to develop an accurate model, focusing on writing production-level code and creating data pipelines from data acquisition to preprocessing and predicting.
The dataset used for this project consists of the following columns:
- Gender
- Race/Ethnicity
- Parental Level of Education
- Lunch Type
- Test Preparation Course
- Math Score (target variable)
- Reading Score
- Writing Score
Several machine learning algorithms were explored to develop the predictive model. The models used include:
- Linear Regression
- Ridge Regression
- Lasso Regression
- Support Vector Regression (SVR)
- Decision Tree Regression
- Random Forest Regression
- K-Nearest Neighbors Regression
- Gradient Boosting Regression
- AdaBoost Regression
- CatBoost Regression
- XGBoost Regression
To identify the best performing models, techniques such as Randomized Search Cross-Validation (RandomizedSearchCV) were employed to tune hyperparameters and optimize model performance. Model evaluation metrics such as mean squared error (MSE), root mean squared error (RMSE), and R-squared values were utilized to assess predictive accuracy and generalization capabilities.
After thorough evaluation, the best-performing models were selected based on their predictive accuracy and performance metrics. These models were potentially combined or further fine-tuned to create the best final predictive model for estimating students' math scores based on the given features.
This project entails developing a comprehensive data processing and modeling pipeline utilizing Python, Flask, Docker, and AWS for deployment. Tasks include conducting thorough EDA, feature engineering, model training, website creation, Dockerization, CI/CD implementation, and AWS deployment setup. The objective is to deliver a robust, scalable solution for data analysis and predictive analytics.
-
Setup GitHub and Local Folder
- Create GitHub repo and .gitignore
- Create venv
- Create
setup.py
- Create
requirements.txt
-
Create Source Code Structure
- Create
src
directory and build the package (requirements.txt
)- Create component files:
data_ingestion.py
,data_transformation.py
,model_trainer.py
- Create pipeline files:
predict_pipeline.py
,train_pipeline.py
- Create exception, logger, and utils files:
exceptions.py
,logger.py
,utils.py
- Create component files:
- Create
-
Exploratory Data Analysis (EDA) in Jupyter Notebook
- Perform EDA
- Handle missing values
- Remove duplicate values
- Data cleaning
- Data imputation
- Feature engineering
- Train-test split
- Identify best performing models
- Model evaluation (R2)
-
Create Simple Webpage for User Input
-
Write Modular Code with respect to the Jupyter Notebook and Test on Local Server (Flask)
-
Docker Configuration and Deployment
- Docker setup and configuration
sudo apt-get update -y sudo apt-get upgrade curl -fsSL https://get.docker.com -o get-docker.sh sudo sh get-docker.sh sudo usermod -aG docker ubuntu newgrp docker
- Build Docker image
-
Configure GitHub Workflow and CI/CD Action Runner
-
Setup AWS Resources for Deployment