Air pollution is a critical environmental issue, and predicting air quality is essential for public health and policymaking.
This project leverages Machine Learning (ML) techniques to forecast Air Quality Index (AQI) based on key environmental pollutants.
The models implemented include Linear Regression, Random Forest, and XGBoost, with XGBoost emerging as the best-performing model.
πΉ Dataset (CSV File): Download city_day.csv
πΉ Dataset (Kaggle Link): View on Kaggle
πΉ Project Code (.ipynb): View Jupyter Notebook
πΉ Presentation (PPTX File): Download Project Report
The dataset used for this project contains 29,531 records and 16 features, including:
- Pollutants: PM2.5, PM10, NOβ, SOβ, CO, Oβ, Benzene, etc.
- Date & City: Identifying the location and time of recording.
- AQI (Target Variable): Measures pollution severity and categorizes it into buckets (Good, Moderate, Poor, etc.).
βοΈ Handling Missing Values:
- Removed rows with missing AQI values.
- Imputed missing pollutant values using mean imputation.
βοΈ Feature Scaling:
- Used StandardScaler to standardize numerical features (zero mean, unit variance).
βοΈ Feature Selection:
- Analyzed feature correlation using a heatmap to determine the most influential pollutants.
- PM2.5 (0.65) and CO (0.68) showed the highest correlation with AQI.
The project explores different regression models to predict AQI:
πΉ Simple and interpretable but struggles with complex patterns.
πΉ Uses multiple decision trees for better accuracy and handles non-linearity.
β
More accurate, faster, and better at handling missing data than Random Forest.
β
Captures complex relationships effectively with boosting techniques.
β
Reduces overfitting and performs well on large datasets.
To measure performance, we used the following metrics:
π Mean Absolute Error (MAE) β Lower values indicate better accuracy.
π Root Mean Squared Error (RMSE) β Penalizes large prediction errors.
π RΒ² Score (Coefficient of Determination) β Measures variance explained by the model.
π Model Comparison:
Model | MAE | RMSE | RΒ² Score |
---|---|---|---|
Linear Regression | High | High | Low |
Random Forest | Moderate | Moderate | Moderate |
XGBoost π | Low | Low | High |
To improve the XGBoost model, we optimized the following parameters using GridSearchCV:
n_estimators
: [50, 100, 200]learning_rate
: [0.01, 0.1, 0.2]max_depth
: [3, 5, 7]
β
Best Configuration:
learning_rate = 0.2, max_depth = 5, n_estimators = 100
The final model was saved using joblib
for future predictions.
- The trained XGBoost Regressor effectively predicts AQI values for unseen data.
- A line plot was used to visualize Actual vs. Predicted AQI, confirming the model's accuracy.
πΉ Improve accuracy using Deep Learning (LSTMs, CNNs).
πΉ Integrate real-time AQI data & meteorological features.
πΉ Develop AI-powered dashboards & mobile apps for real-time AQI tracking.
This was a group project, but I contributed to the entire project, including:
βοΈ Data Preprocessing & Cleaning
βοΈ Model Implementation & Evaluation
βοΈ Hyperparameter Tuning & Optimization
βοΈ Model Interpretation & Documentation