This project was carried out as part of the AI and Applications course in the 2024-2025 academic year. The goal was to predict rainfall for the following day based on weather data from Algeria, using machine learning and deep learning models.
The project aimed to predict whether it would rain the next day (RainTomorrow) using weather data collected between January 2010 and December 2014. The target variable is binary: "Yes" if rainfall is expected to be greater than 1 mm, and "No" otherwise. We applied both traditional machine learning algorithms and deep learning techniques to tackle this problem.
The project began with an extensive analysis of the dataset, which included:
- Handling missing values.
- Identifying strong correlations with the target variable.
- Transforming categorical variables into numerical features.
Several machine learning models were implemented and evaluated, including:
- Logistic Regression
- SVM (Support Vector Machine)
- Random Forest
- K-Nearest Neighbors (K-NN)
- Decision Trees
We explored various deep learning architectures to improve the prediction:
- Simple Fully Connected Networks
- Convolutional Models
- LSTM (Long Short-Term Memory) Models for capturing temporal dependencies in the data.
Evaluation metrics included accuracy, precision, recall, and F1 score, particularly focusing on recall to address the imbalance between the classes.
Machine learning models such as Logistic Regression, SVM, and Random Forest demonstrated solid performance in predicting the absence of rain (class 0). However, they struggled to correctly identify rainy days (class 1), as evidenced by the low recall and F1 scores for this class. The class imbalance had a significant impact on performance.
Deep learning models, especially LSTM networks, performed better in capturing temporal dependencies, improving recall for rainy days (class 1) compared to machine learning models. However, this came at the cost of reduced overall accuracy, highlighting the trade-off between recall and precision for imbalanced datasets.
In summary, machine learning models performed better for the majority class (no rain), while deep learning models, particularly LSTMs, showed better performance in predicting rainy days. However, both approaches faced challenges with the class imbalance in the dataset.
To improve results, further exploration of the following strategies could be valuable:
- Data augmentation to balance the classes.
- Generative Adversarial Networks (GANs) for synthetic data generation.
- Hyperparameter optimization with tools like Optuna to fine-tune models for the specific dataset.
- Data imputation based on seasonality to better handle missing data, especially for features like sunlight, which are consistent year-over-year.
While the current models provide a solid foundation for predicting rainfall, improving the detection of rainy days remains a key challenge, requiring more advanced methods and potentially more balanced datasets.
- Python installed on your machine.
- Libraries such as scikit-learn, PyTorch, and PyTorch Lightning should be installed.
- Use Jupyter or VS Code to run the notebooks.
-
rain-prediction-machine-learning.ipynb
→ Implements machine learning models such as Logistic Regression, SVM, Random Forest, K-NN, and Decision Trees. -
rain-prediction-deep-learning.ipynb
→ Implements deep learning models, including SimpleModel, ConvModel, and LSTMModel, using PyTorch Lightning.
- The LSTM model showed significant improvements in recall for predicting rainy days compared to other models.