The asf_hp_cost_estimator_model
repository contains the code to model and predict the cost of an air source heat pump:
- in residential properties
- installed as part of a retrofit (heat pumps installed in new builds or as part of a cluster of installations are excluded)
- in houses or bungalows (flats are excluded from the analysis)
- houses with 2 or more habitable rooms and with a floor area between 20 and 500 m2
- for Scotland, Wales and English regions
Quantile regression gradient boosting regressor models are fitted to create prediction intervals for the cost of an air source heat pump (80% confidence intervals, by fitting models on the 10th and 90th percentile).
The target variable is the overall cost of installation and the predictors include:
- Total floor area
- Number of habitable rooms (2 to 8+)
- Number of days between 2007 and HP installation (as a measure of time)
- Property built form: detached, semi detached, mid terrace and end terrace
- Property type: bungalow and house
- Construction age band: pre-1929, 1930-1965, 1966-1982, 1983-2006 and 2007 onwards
- Region: Scotland, Wales, London, East Midlands, West Midlands, East of England, South East, South West, North West, North East and Yorkshire and the Humber.
The latest model in use by the cost estimator tool was trained on data up to Q1 2025 (March 2025).
This is a subset of the MCS Installations Database (MID), and contains one record for each MCS certificate associated with a heat pump installation. The dataset contains records of both domestic and non-domestic air source, water/ground source and other types of heat pump installations. Features in the dataset include:
- information about the property: address, heat and water demand
- characteristics of the heat pump installed: type, model, manufacturer, capacity, flow temperature, SCOP
- information about the installation: commissioning date, overall cost of installation
The overall installation cost is the full cost of installation including materials and labour, not just the cost of the heat pump unit. To note that this cost is the cost prior to deducting government grants such as the Boiler Upgrade Scheme (BUS) grant or Home Energy Scotland (HES) grant.
MID data is used with permission from MCS and subject to the conditions of a data sharing agreement.
Property data comes from England and Wales and Scotland's EPC register. The EPC register provides data on building characteristics and energy efficiency measures, including:
- Property address and other location information;
- Property characteristics such as number of rooms, property type and built form.
- Heating system(s) installed;
- Energy efficiency ratings.
The EPC Register datasets are open-source and accessible to everyone.
The following location lookups are used:
-
Postcode to OA (2021) to LSOA to MSOA to LAD (November 2024) Best Fit Lookup in the UK
-
Local Authority District to Region (December 2024) Lookup in EN
-
Postcode to OA (2021) to LSOA to MSOA to LAD (November 2024) Best Fit Lookup in the UK
-
Local Authority District to Region (April 2021) Lookup in EN
The "CPI INDEX 05.3 : Household appliances, fitting and repairs 2015=100" from the inflation and price indices data was sourced from the ONS
The underlying dataset used to model the cost of an air source installation is the MCS installations dataset enhanced with EPC information about properties. MCS and EPC datasets are cleaned and preprocessed before being joined. Installations without EPC property information are removed from the analysis. The code for preprocessing and joining MCS to EPC is available in the asf_core_data GitHub repository.
The repository structure and key scripts are highlighted below:
asf_hp_cost_estimator_model
├───config/
│ Configuration scripts
│ ├─ base.yaml
├───getters/
│ Scripts with functions to load data from S3
│ ├─ data_getters.py
├───pipeline/
│ Subdirs with scripts to process data and produce outputs
│ ├─ data_processing/ - further data processing prior to modelling
| | ├─ process_installations_data.py
│ ├─ model_training/ - model training scripts
| | |- fit_cost_prediction_intervals.py
│ ├─ model_evaluation/ - scripts for model evaluation
| | |- cross_validation.py
│ ├─ hyperparameter_tuning/ - scripts for hyperparameter tuning
| | |- tune_hyperparameters.py
│ ├─ README.md - instructions to run the different pipelines
├───utils/
│ Utils for plotting and evaluation
│ ├─ plotting_utils.py
│ ├─ model_evaluation_utils.py
├───notebooks/
│ Notebooks for data and model exploration
These are instructions for data scientists at Nesta.
When new quarter data is made available you can follow the steps to retrain the cost models (after the data has been processed with asf_core_data).
- Open an issue in this GitHub repository, such as "Retrain model with QX 202Y data"
- Update
asf_hp_cost_estimator_model/config/base.yaml
cpi_reference_year
: update the CPI reference year accordingly- Location data sources: review and update location sources as required
mcs_epc_filename_date
: update with newest date of MCS-EPC data processing
- Re-run hyperparameter tuning pipeline:
- Run
python asf_hp_cost_estimator_model/pipeline/hyperparameter_tuning/tune_hyperparameters.py
- Take note of the hyperparameters logged
- Update
asf_hp_cost_estimator_model/config/base.yaml
after tuning hyperparameters:- change
hyper_parameters
according to the hyperparameters logged in the previous step
- change
- Re-run cross-validation pipeline:
- Run
python asf_hp_cost_estimator_model/pipeline/model_evaluation/cross_validation.py
- Assess results logged
- Run
- Retrain models:
- Run
python asf_hp_cost_estimator_model/pipeline/model_training/fit_cost_prediction_intervals.py
- Models are saved to S3
- Run
- Update sections "🆕 Latest data" and "🧩 Data sources" of this
REAMDE.md
to reflect changes. - Let the tech/design team know that the model has been updated, so that they can restart the API.
- Meet the data science cookiecutter requirements, in brief:
- Install:
direnv
andconda
- Install:
- Run
make install
to configure the development environment:- Setup the conda environment
- Configure
pre-commit
Technical and working style guidelines
Project based on Nesta's data science project template (Read the docs here).