
Workflows for the Invasive Alien Species Digital Twin (IASDT), as part of the Horizon Europe project tiled Biodiversity Digital Twin.
This is a collection of PyDoit workflows for data processing, data assimilation, state management, metadata management, data and HPC servicing, and job orchestration in the IASDT.
Caution
This repository is under active development and is not yet ready for public use. Please contact the author for more information.
Tip
The branching in this repository is named after each HPC system where the workflows are executed. For example, the lumi
branch contains the workflows for the LUMI supercomputer, which is the main branch.
- Table of Contents
- Overview
- Folder Descriptions
- Usage
- Create Documentation
- Logging
- Environment Variables
- Model and Data processing
- Data Storage and Availability
- Dashboard
- Metadata and RO-Crates
- Containerization
- License
- Cite as
- Author
- Contributors
The Invasive Alien Species Digital Twin (IASDT) is a digital twin that uses dynamic data-driven workflows for joint species distribution modelling of invasive alien species (IAS) in continental Europe with Hierarchical Modelling of Species Communities (HMSC) models. The IASDT uses biotic and abiotic datasets to estimate the current and forecast the future distribution of IAS in Europe under various climate scenarios. The IASDT is part of the Biodiversity Digital Twin (BioDT) project, which is funded by the European Union.
The workflows are written in Python and R, and are automated using the PyDoit build automation tool. These workflows are meant to be executed on HPC systems, mainly LUMI. The workflows are designed to be modular and scalable based on the TwinEco framework. These workflows are designed to be used with the OPeNDAP Cloud Server to serve data to third-party applications.
A detailed overview can be found on the project wiki: https://wiki.eduuni.fi/x/Yg2cEw
The IASDT follows the TwinEco framework for building DTs in ecology. The pre-print:
Khan, T., de Koning, K., Endresen, D., Chala, D. and Kusch, E., 2024. TwinEco: A Unified Framework for Dynamic Data-Driven Digital Twins in Ecology. bioRxiv, pp.2024-07.
Figure 1: An overview of the Invasive Alien Species Digital Twin (IASDT) components. 1) Dynamic Data-Driven Application Systems (DDDAS) based workflows listen for changes in data sources (1.a. feedback loops), pull and process required data (1.b. data processing), merge and reconcile new and old data (1.c. data assimilation), version datasets and add metadata (1.d. state + FAIR metadata management), and transfer updated datasets (data + log files) to a data server (1.e. data servicing). 2) OPeNDAP Cloud Server services the datasets from the previous component and provides an interface to all IASDT data (input, output, metadata, and log files). The server also serves as an interface for third-party applications to access information contained in the IASDT. 3) IAS Joint Species Distribution Model is the modelling block of IASDT that uses input data to estimate gridded IAS numbers per habitat type. 4) IASDT dashboard presents aggregated results of IASDT in a simplified and intuitive manner to BioDT users and stakeholders and serves as a communication tool.
Figure 2: Study area is defined as the area of the EEA Reference Grid. The study area is divided into 10x10 km grid cells. The grid cells are projected in the ETRS89-LAEA projection (EPSG:3035).
Data | Spatial resolution | Temporal resolution | Details | Source |
---|---|---|---|---|
Reference grid | 10 km | --- | The European Environment Agency's (EEA) reference grid at 10 km resolution at Lambert Azimuthal Equal Area projection (EPSG:3035). All data listed below were processed into this reference grid. | https://www.eea.europa.eu/en/datahub/datahubitem-view/3c362237-daa4-45e2-8c16-aaadfb1a003b |
Species observations | Global Biodiversity Information Facility (GBIF) | points | > 1981 | The most up-to-date version of occurrence data is dynamically downloaded from GBIF using the rgbif R package (Chamberlain et al. 2023) (> 8 million occurrences, March 2024; Figure 2a). Doubtful occurrences and occurrences with high spatial uncertainty are excluded. |
European Alien Species Information Network (EASIN) | points | > 1981 | EASIN provides spatial data on 14,000 alien species. Species occurrences were downloaded using EASIN's API. Thirty-four partners shared their data with EASIN (including GBIF). Only non-GBIF data from EASIN were considered in the models (> 692 K observations for 483 IAS; March 2024; Figure 2b). | European Commission - Joint Research Centre - European Alien Species Information Network (EASIN) https://easin.jrc.ec.europa.eu/ |
Integrated European Long-Term Ecosystem, Critical Zone and socio-ecological Research (eLTER) | points | > 1981 | eLTER is a network of sites collecting ecological data for long-term research within the EU. Vegetation data from 137 eLTER sites were processed and homogenised. The final eLTER dataset comprises 5,265 observations from 46 sites, representing 110 IAS (Figure 2c). | https://elter-ri.eu/ |
Habitat information | Corine Land Cover (CLC) | 100 m | 2017-2018 | CLC dataset is a pan-European land-cover and land-use inventory with 44 thematic classes, ranging from broad forested areas to individual vineyards. We are currently using V2020_20u1 of CLC data, but the data workflow is flexible to use future versions of CLC data. |
Climate data | Climatologies at high resolution for the Earth’s land surface areas (CHELSA) | 30 arc seconds; ~ 1 km | 1981–20102011–20402041–20702071–2100 | CHELSA provides global high-resolution data on various environmental variables currently and in different future climate scenarios. Six ecologically meaningful and low-correlated bioclimatic variables are used in the models- temperature seasonality (bio4)- mean daily minimum air temperature of the coldest month (bio6)- mean daily mean air temperatures of the wettest quarter (bio8)- annual precipitation amount (bio12)- precipitation seasonality (bio15)- mean monthly precipitation amount of the warmest quarter (bio18)In addition to current climate conditions, there are nine options for multiple climate CMIP6 models (3 shared socioeconomic pathways [ssp126 - ssp370 - ssp585] × 3 time slots [2011-2040 - 2041-2070 - 2071-2100]). |
Road intensity | lines | most recent | The total length of roads per grid cell was computed from the most recent version of the GRIP (Global Roads Inventory Project) global roads database. | Meijer et al. (2018)https://www.globio.info/download-grip-dataset |
Railway intensity | lines | most recent | The total length of railways per grid cell was computed from the most recent version of OpenRailwayMap. | https://www.openrailwaymap.org/ |
Sampling bias | points | > 1981 | The total number of vascular plant observations per grid cell in the GBIF database was computed (> 230 million occurrences, March 2024). | https://www.gbif.org/ |
- assets/ --> static assets (images, videos, etc.)
- datasets/ --> datasets divided into
raw
,interim
, andprocessed
sub-folders - docs/ --> software documentation
- logs/ --> logs for workflow runs
- notebooks/ --> jupyter notebooks as playground and testing environment
- references/ --> reference files
- workflows/ --> Pydoit workflows
- feedbackloop --> feedback loop tasks for "listening" to data changes and downloading datasets
- process --> data processing tasks
- state --> state management tasks
- service --> downstream data servicing and HPC management tasks
- dodo.py --> Main entry point for Pydoit workflows
- Clone the repository to your local or cloud development environment.
- Create and configure the
.env
file with the necessary credentials and settings. - Install all dependencies from
requirements.txt
andrenv.lock
files. - Use the workflow directory as the current working directory.
- Run the following command in the CLI for listing available tasks:
pydoit list
- Run all tasks and actions with pydoit command or individual - tasks using
pydoit <task-name>
command in a shell. - Parallel task execution can be enabled by running the command
doit -n 4
(n defines the number of cores to attach to pydoit runtime).
Alternatively, simply run the following command from the root fodler to run all tasks:
bash entry.sh
Run the following code to create Sphinx documentation.
cd docs
make html
LUMI timestamps are in Finnish time
IASDT Workflows use Unix styled logging with the following logging levels:
- Warning: Warning logs
- Info: Informational logs
- Debug: Debugging logs
- Error: Errors
- Critical: Critical errors
Logging is mostly done using the logging
module in Python. However, some tasks use R
where scripts are submitted to the HPC slurm queue. In such cases, the logs will be stored to .out
and .err
files in the logs
directory.
The IASDT workflows use environment variables to pass parameters to the workflows. This convention is defined in the references/parameter_naming_conventions.txt
file. The naming convention for the environment variables is as follows:
Workflow layers
- FL=Feedback loop
- DP=Data Processing
- DA=Data Assimilation
- SM=State Management
- MM=Metadata Management
- DS=Data Servicing
- MC=Model Communication
Programming languages and tools
- R=R Lang
- Py=Python Lang
- Do=Docker
- PyDo=PyDoit
Convention
<layer>_<programming tools>_<data source>_<parameter name>=<parameter value>
Example
DP_R_CHELSA_Gridsize=10
All the required environment variables can be found in the references/env-var-list.csv
file.
The model and data processing code is developed separately in a R package called IASDT.R
. The R package can be found in the IASDT.R Github repository.
IASDT.R package: https://github.com/BioDT/IASDT.R
The IASDT uses a custom built and hosted Open-source Project for a Network Data Access Protocol (OPeNDAP) Catalog to serve data to any application. The OPeNDAP Catalog is hosted on a virtual machine (VM). The OPeNDAP Catalog is used to serve data to third-party applications, such as the IASDT dashboard, and provides an interface for users to access input/output data stored in the IASDT.
The OPenDAP Catalog clones some defined data from the HPC into a VM using Docker and serves it using the Data Access Protocol (DAP), which is a defined data model for accessing remote scientific datasets. The magic here is that DAP allows users to query subsets of the data files, while automatically giving variable-level access (see example), and automatically assigning metadata to the contents of each file (see example).
- Opendap Catalog Link: http://opendap.biodt.eu/
- Opendap Catalog documentation: https://khant.pages.ufz.de/opendap/chapters/concept/opendap.html
- Opendap Catalog code: https://git.ufz.de/khant/opendap
The IASDT dashbaord is created using RShiny, and is linked to the DT OpenDAP server. The dashboard will be used to present the results of the IASDT to users and stakeholders in a simplified and intuitive manner.
- Dashboard code: https://github.com/allantsouza/IAS-pDT-BioDT-web-app
- Dashboard link: https://allantsouza.shinyapps.io/IAS-pDT-BioDT-web-app/
Sample screenshot:
The IASDT uses the Research Object Crate (RO-Crate) metadata standard to describe the data and workflows. The RO-Crate metadata standard is a community-driven specification for packaging research data with associated metadata. The RO-Crate metadata standard is designed to be machine-readable and human-readable, and it is designed to be used with a wide range of research data types, including datasets, software, and workflows.
We will use the PyDidIt software (developed in-house) for generating workflow crates and the RO-Crate Python library for generating RO-Crate metadata for the data. The RO-Crate metadata will be stored in the same directory as the data, and it will be used to describe the data and the workflows that generated the data.
- RO-Crate documentation: https://www.researchobject.org/ro-crate/
- RO-Crate Python library: https://pypi.org/project/ro-crate/
Parts of the IASDT (specifically modelling) are containerized using Singularity containers. The containers are built using the Singularity containerization software and are used to package the IASDT modelling code and dependencies. The containers are used to run the IASDT modelling code on the HPC system, and they are used to ensure that the code runs in a consistent environment across different systems.
- Singularity documentation: https://sylabs.io/guides/3.7/user-guide/
- Container template (HMSC-HPC): https://github.com/BioDT/hmsc-container
- Container template (R): https://git.ufz.de/biodt/iasdt-modelling-containers
@misc{biodt_iasdt_2025,
author = {Khan, Taimur},
title = {BioDT: Invasive Alien Species Digital Twin (IASDT) Workflows},
year = {2025},
month = {02},
note = {Biodiveristy Digital Twin (BioDT), Funded by the European Union }
doi = {10.5281/zenodo.14756907},
}
- Taimur Khan, Helmholtz centre for environmental research - UFZ | Workflows, Architecture, HPC, Data Processing, Containerization, Opendap server
- Ahmed El-Gabbas, Helmholtz centre for environmental research - UFZ | IASDT.R, Modelling, Data Processing
- Dylan Kierans, German Climate Computing Center (DKRZ) | Containerization
- Julian Lopez Gordillo, Naturalis Biodiversity Center | Metadata and RO-Crates
- Oliver Wooland, University of Manchester | RO-Crates(Pydidit)
- Allan Souza, Institute for Atmospheric and Earth System Research INAR | Dashboard
- Tuomas Rossi , CSC – IT Center for Science Ltd. | HPC, Containerization
- Jakub Konvicka, IT4Innovations | HPC (LEXIS)