Skip to content

AQI prediction using Machine Learning (Linear Regression, Random Forest, XGBoost) with XGBoost as the best model, featuring data preprocessing, training, evaluation, and tuning for accurate air pollution insights. πŸš€

Notifications You must be signed in to change notification settings

SunnyRao07/Air-Quality-Index-AQI-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🌍 Air Quality Index (AQI) Prediction

πŸ“Œ Project Overview

Air pollution is a critical environmental issue, and predicting air quality is essential for public health and policymaking.
This project leverages Machine Learning (ML) techniques to forecast Air Quality Index (AQI) based on key environmental pollutants.
The models implemented include Linear Regression, Random Forest, and XGBoost, with XGBoost emerging as the best-performing model.


πŸ“‚ Project Resources

πŸ”Ή Dataset (CSV File): Download city_day.csv
πŸ”Ή Dataset (Kaggle Link): View on Kaggle
πŸ”Ή Project Code (.ipynb): View Jupyter Notebook
πŸ”Ή Presentation (PPTX File): Download Project Report


πŸ“Š Dataset Overview

The dataset used for this project contains 29,531 records and 16 features, including:

  • Pollutants: PM2.5, PM10, NOβ‚‚, SOβ‚‚, CO, O₃, Benzene, etc.
  • Date & City: Identifying the location and time of recording.
  • AQI (Target Variable): Measures pollution severity and categorizes it into buckets (Good, Moderate, Poor, etc.).

πŸ›  Data Preprocessing

βœ”οΈ Handling Missing Values:

  • Removed rows with missing AQI values.
  • Imputed missing pollutant values using mean imputation.

βœ”οΈ Feature Scaling:

  • Used StandardScaler to standardize numerical features (zero mean, unit variance).

βœ”οΈ Feature Selection:

  • Analyzed feature correlation using a heatmap to determine the most influential pollutants.
  • PM2.5 (0.65) and CO (0.68) showed the highest correlation with AQI.

πŸ€– Models Used

The project explores different regression models to predict AQI:

1️⃣ Linear Regression

πŸ”Ή Simple and interpretable but struggles with complex patterns.

2️⃣ Random Forest Regressor

πŸ”Ή Uses multiple decision trees for better accuracy and handles non-linearity.

3️⃣ XGBoost Regressor (πŸ† Best Model)

βœ… More accurate, faster, and better at handling missing data than Random Forest.
βœ… Captures complex relationships effectively with boosting techniques.
βœ… Reduces overfitting and performs well on large datasets.


πŸ“ˆ Model Evaluation

To measure performance, we used the following metrics:

πŸ“Œ Mean Absolute Error (MAE) – Lower values indicate better accuracy.
πŸ“Œ Root Mean Squared Error (RMSE) – Penalizes large prediction errors.
πŸ“Œ RΒ² Score (Coefficient of Determination) – Measures variance explained by the model.

πŸ“Œ Model Comparison:

Model MAE RMSE RΒ² Score
Linear Regression High High Low
Random Forest Moderate Moderate Moderate
XGBoost πŸ† Low Low High

πŸ”§ Hyperparameter Tuning

To improve the XGBoost model, we optimized the following parameters using GridSearchCV:

  • n_estimators: [50, 100, 200]
  • learning_rate: [0.01, 0.1, 0.2]
  • max_depth: [3, 5, 7]

βœ… Best Configuration:
learning_rate = 0.2, max_depth = 5, n_estimators = 100

The final model was saved using joblib for future predictions.


πŸ“Œ Results & Predictions

  • The trained XGBoost Regressor effectively predicts AQI values for unseen data.
  • A line plot was used to visualize Actual vs. Predicted AQI, confirming the model's accuracy.

πŸš€ Future Scope

πŸ”Ή Improve accuracy using Deep Learning (LSTMs, CNNs).
πŸ”Ή Integrate real-time AQI data & meteorological features.
πŸ”Ή Develop AI-powered dashboards & mobile apps for real-time AQI tracking.


πŸ† Individual Contribution

This was a group project, but I contributed to the entire project, including:
βœ”οΈ Data Preprocessing & Cleaning
βœ”οΈ Model Implementation & Evaluation
βœ”οΈ Hyperparameter Tuning & Optimization
βœ”οΈ Model Interpretation & Documentation


About

AQI prediction using Machine Learning (Linear Regression, Random Forest, XGBoost) with XGBoost as the best model, featuring data preprocessing, training, evaluation, and tuning for accurate air pollution insights. πŸš€

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published