This project is a machine learning pipeline designed to predict the efficiency of solar panels based on sensor and operational data. The goal is to help optimize solar panel performance, anticipate maintenance needs, and improve energy yield by providing accurate efficiency predictions.
- Renewable energy optimization: Solar panel efficiency can be affected by many factors (weather, age, soiling, etc.). Predicting efficiency helps operators maximize output and plan maintenance.
- Data-driven insights: By leveraging advanced machine learning, we can uncover hidden patterns and relationships in solar panel data.
- Automation: The pipeline automates feature selection, hyperparameter tuning, and model ensembling for robust, production-ready predictions.
- Loads and preprocesses solar panel data (from CSV files)
- Selects the most important features using LightGBM
- Tunes model hyperparameters using Optuna (Bayesian optimization)
- Trains a stacking ensemble of multiple models (LightGBM, XGBoost, CatBoost, RandomForest)
- Evaluates model performance with metrics like RMSE, MAE, R², etc.
- Saves the trained model and selected features for reproducible predictions
- Generates predictions for new, unseen test data
- The main programming language for data science and machine learning.
- For data manipulation, cleaning, and numerical operations.
- For model selection, feature selection, stacking ensemble, and evaluation metrics.
- Key features used:
SelectFromModel
: Feature selection based on model importance.StackingRegressor
: Combines multiple models for improved performance.train_test_split
,KFold
: For splitting data and cross-validation.
- Fast, efficient gradient boosting framework.
- Key hyperparameters:
learning_rate
: Controls how much the model learns in each iteration.num_leaves
: Number of leaves in one tree (controls complexity).max_depth
: Maximum tree depth.n_estimators
: Number of boosting rounds (trees).min_child_samples
: Minimum samples in a leaf.subsample
,colsample_bytree
: Row/column sampling for regularization.reg_alpha
,reg_lambda
: L1/L2 regularization.categorical_feature
: Native support for categorical columns.early_stopping
: Stops training when validation score doesn't improve.
- Another high-performance gradient boosting library.
- Used as a base learner in the stacking ensemble.
- Gradient boosting with excellent categorical feature support.
- Used as a base learner in the stacking ensemble.
- Automated hyperparameter optimization framework.
- Features:
- Bayesian optimization for efficient search.
- Parallel/distributed search support.
- Easy integration with scikit-learn and LightGBM.
- For saving and loading models and feature lists.
- For data visualization and exploratory data analysis (EDA).
dataset/
Clean_X_Train.csv
Clean_Test_Data.csv
src/
modelling/
model_training.py
model_evaluation.py
utils/
visualization.py
reports/
evaluation_report.csv
final_submission.csv
models/
ensemble_model.pkl
selected_features.pkl
main.py
predict.py
requirements.txt
README.md
-
Install dependencies:
pip install -r requirements.txt
-
Prepare your data:
- Place cleaned train/test CSVs in the
dataset/
folder.
- Place cleaned train/test CSVs in the
-
Train and evaluate the model:
python main.py
-
Generate predictions for new data:
python predict.py
- Feature engineering: Add new features in your preprocessing scripts for better performance.
- Model tuning: Adjust Optuna search space or stacking ensemble in
model_training.py
. - Evaluation: Extend
model_evaluation.py
for more metrics or plots.
- Team Elytra -Gopal Dutta -Chaitany Agrawal
This project is for educational and research purposes.