This repository contains the project that serves as the finals for our machine learning class. The project involves monitoring air quality data in Athens, Greece, with a focus on time series analysis and multivariate forecasting of PM2.5 levels using various machine learning and deep learning algorithms.
- Analyze air quality data from Athens as time series data.
- Evaluate and compare the performance of different models on various schemes.
The dataset used is "Regional Datasets for Air Quality Monitoring in European Cities" published at the 2024 IEEE IGARS-24 or IEEE International Geoscience and Remote Sensing conference on July 7-12 in Athens, Greece. The dataset is also available via the Kaggle platform by the user Vladimir Demidov. The data features details are also provided in the data card.
The project workflow more or less follows the given flowchart:
Explanation:
-
Data Analysis, self-explanatory.
-
Data Preprocessing, create temporal features and handle outliers.
-
Model Construction/Initialization, models used are Linear Regression, Random Forest, XGBoost, and LSTM with the given architecture:
-
Compare and Evaluate models on various schemes, most of the schemes are pretty straightforward but there is one scheme that is worth noting, namely scheme 4.
After a bit of testing, we found out that the model is having a hard time predicting fluctuations or extreme values in the data. So the intention here is that, by using bagging, we can compensate for the tendency of the model to underpredict the values without going off too much from the ground truth.
- w/o bagging
- with bagging
Special thanks to my colleagues, Keanu Taufan and Zelvan Wijaya, and to our machine learning class lecturer, Dini Adni Navastara, and teaching assistants for their support and guidance throughout this project.