Skip to content

BioDT/uc-ias-workflows

Repository files navigation


Workflows for the Invasive Alien Species Digital Twin (IASDT), as part of the Horizon Europe project tiled Biodiversity Digital Twin.

This is a collection of PyDoit workflows for data processing, data assimilation, state management, metadata management, data and HPC servicing, and job orchestration in the IASDT.

Overview Paper: DOI:10.1101/2024.07.23.604592

Code DOI: DOI

Caution

This repository is under active development and is not yet ready for public use. Please contact the author for more information.

Tip

The branching in this repository is named after each HPC system where the workflows are executed. For example, the lumi branch contains the workflows for the LUMI supercomputer, which is the main branch.

Table of Contents

Overview

The Invasive Alien Species Digital Twin (IASDT) is a digital twin that uses dynamic data-driven workflows for joint species distribution modelling of invasive alien species (IAS) in continental Europe with Hierarchical Modelling of Species Communities (HMSC) models. The IASDT uses biotic and abiotic datasets to estimate the current and forecast the future distribution of IAS in Europe under various climate scenarios. The IASDT is part of the Biodiversity Digital Twin (BioDT) project, which is funded by the European Union.

The workflows are written in Python and R, and are automated using the PyDoit build automation tool. These workflows are meant to be executed on HPC systems, mainly LUMI. The workflows are designed to be modular and scalable based on the TwinEco framework. These workflows are designed to be used with the OPeNDAP Cloud Server to serve data to third-party applications.

A detailed overview can be found on the project wiki: https://wiki.eduuni.fi/x/Yg2cEw

Architectural overview

The IASDT follows the TwinEco framework for building DTs in ecology. The pre-print:

Khan, T., de Koning, K., Endresen, D., Chala, D. and Kusch, E., 2024. TwinEco: A Unified Framework for Dynamic Data-Driven Digital Twins in Ecology. bioRxiv, pp.2024-07. DOI:10.1101/2024.07.23.604592



Figure 1: An overview of the Invasive Alien Species Digital Twin (IASDT) components. 1) Dynamic Data-Driven Application Systems (DDDAS) based workflows listen for changes in data sources (1.a. feedback loops), pull and process required data (1.b. data processing), merge and reconcile new and old data (1.c. data assimilation), version datasets and add metadata (1.d. state + FAIR metadata management), and transfer updated datasets (data + log files) to a data server (1.e. data servicing). 2) OPeNDAP Cloud Server services the datasets from the previous component and provides an interface to all IASDT data (input, output, metadata, and log files). The server also serves as an interface for third-party applications to access information contained in the IASDT. 3) IAS Joint Species Distribution Model is the modelling block of IASDT that uses input data to estimate gridded IAS numbers per habitat type. 4) IASDT dashboard presents aggregated results of IASDT in a simplified and intuitive manner to BioDT users and stakeholders and serves as a communication tool.

Study area & geospatial projection


Figure 2: Study area is defined as the area of the EEA Reference Grid. The study area is divided into 10x10 km grid cells. The grid cells are projected in the ETRS89-LAEA projection (EPSG:3035).

Datasets

Data Spatial resolution Temporal resolution Details Source
Reference grid 10 km --- The European Environment Agency's (EEA) reference grid at 10 km resolution at Lambert Azimuthal Equal Area projection (EPSG:3035). All data listed below were processed into this reference grid. https://www.eea.europa.eu/en/datahub/datahubitem-view/3c362237-daa4-45e2-8c16-aaadfb1a003b
Species observations Global Biodiversity Information Facility (GBIF) points > 1981 The most up-to-date version of occurrence data is dynamically downloaded from GBIF using the rgbif R package (Chamberlain et al. 2023) (> 8 million occurrences, March 2024; Figure 2a). Doubtful occurrences and occurrences with high spatial uncertainty are excluded.
European Alien Species Information Network (EASIN) points > 1981 EASIN provides spatial data on 14,000 alien species. Species occurrences were downloaded using EASIN's API. Thirty-four partners shared their data with EASIN (including GBIF). Only non-GBIF data from EASIN were considered in the models (> 692 K observations for 483 IAS; March 2024; Figure 2b). European Commission - Joint Research Centre - European Alien Species Information Network (EASIN) https://easin.jrc.ec.europa.eu/
Integrated European Long-Term Ecosystem, Critical Zone and socio-ecological Research (eLTER) points > 1981 eLTER is a network of sites collecting ecological data for long-term research within the EU. Vegetation data from 137 eLTER sites were processed and homogenised. The final eLTER dataset comprises 5,265 observations from 46 sites, representing 110 IAS (Figure 2c). https://elter-ri.eu/
Habitat information Corine Land Cover (CLC) 100 m 2017-2018 CLC dataset is a pan-European land-cover and land-use inventory with 44 thematic classes, ranging from broad forested areas to individual vineyards. We are currently using V2020_20u1 of CLC data, but the data workflow is flexible to use future versions of CLC data.
Climate data Climatologies at high resolution for the Earth’s land surface areas (CHELSA) 30 arc seconds; ~ 1 km 1981–20102011–20402041–20702071–2100 CHELSA provides global high-resolution data on various environmental variables currently and in different future climate scenarios. Six ecologically meaningful and low-correlated bioclimatic variables are used in the models-        temperature seasonality (bio4)-        mean daily minimum air temperature of the coldest month (bio6)-        mean daily mean air temperatures of the wettest quarter (bio8)-        annual precipitation amount (bio12)-        precipitation seasonality (bio15)-        mean monthly precipitation amount of the warmest quarter (bio18)In addition to current climate conditions, there are nine options for multiple climate CMIP6 models (3 shared socioeconomic pathways [ssp126 - ssp370 - ssp585] × 3 time slots [2011-2040 - 2041-2070 - 2071-2100]).
 Road intensity lines most recent The total length of roads per grid cell was computed from the most recent version of the GRIP (Global Roads Inventory Project) global roads database. Meijer et al. (2018)https://www.globio.info/download-grip-dataset
 Railway intensity lines most recent The total length of railways per grid cell was computed from the most recent version of OpenRailwayMap. https://www.openrailwaymap.org/
 Sampling bias points > 1981 The total number of vascular plant observations per grid cell in the GBIF database was computed (> 230 million occurrences, March 2024). https://www.gbif.org/

Folder Descriptions

  • assets/ --> static assets (images, videos, etc.)
  • datasets/ --> datasets divided into raw, interim, and processed sub-folders
  • docs/ --> software documentation
  • logs/ --> logs for workflow runs
  • notebooks/ --> jupyter notebooks as playground and testing environment
  • references/ --> reference files
  • workflows/ --> Pydoit workflows
    • feedbackloop --> feedback loop tasks for "listening" to data changes and downloading datasets
    • process --> data processing tasks
    • state --> state management tasks
    • service --> downstream data servicing and HPC management tasks
    • dodo.py --> Main entry point for Pydoit workflows

Usage

  • Clone the repository to your local or cloud development environment.
  • Create and configure the .env file with the necessary credentials and settings.
  • Install all dependencies from requirements.txt and renv.lock files.
  • Use the workflow directory as the current working directory.
  • Run the following command in the CLI for listing available tasks: pydoit list
  • Run all tasks and actions with pydoit command or individual - tasks using pydoit <task-name> command in a shell.
  • Parallel task execution can be enabled by running the command doit -n 4 (n defines the number of cores to attach to pydoit runtime).

Alternatively, simply run the following command from the root fodler to run all tasks:

bash entry.sh

Create Documentation

Run the following code to create Sphinx documentation.

cd docs
make html

Logging

LUMI timestamps are in Finnish time

IASDT Workflows use Unix styled logging with the following logging levels:

  • Warning: Warning logs
  • Info: Informational logs
  • Debug: Debugging logs
  • Error: Errors
  • Critical: Critical errors

Logging is mostly done using the logging module in Python. However, some tasks use R where scripts are submitted to the HPC slurm queue. In such cases, the logs will be stored to .out and .err files in the logs directory.

Environment Variables

Workflow parameter naming convention

The IASDT workflows use environment variables to pass parameters to the workflows. This convention is defined in the references/parameter_naming_conventions.txtfile. The naming convention for the environment variables is as follows:

Workflow layers

  • FL=Feedback loop
  • DP=Data Processing
  • DA=Data Assimilation
  • SM=State Management
  • MM=Metadata Management
  • DS=Data Servicing
  • MC=Model Communication

Programming languages and tools

  • R=R Lang
  • Py=Python Lang
  • Do=Docker
  • PyDo=PyDoit

Convention

<layer>_<programming tools>_<data source>_<parameter name>=<parameter value>

Example

DP_R_CHELSA_Gridsize=10

All the required environment variables can be found in the references/env-var-list.csv file.

Model and Data processing

The model and data processing code is developed separately in a R package called IASDT.R. The R package can be found in the IASDT.R Github repository.

IASDT.R package: https://github.com/BioDT/IASDT.R

Data Storage and Availability

The IASDT uses a custom built and hosted Open-source Project for a Network Data Access Protocol (OPeNDAP) Catalog to serve data to any application. The OPeNDAP Catalog is hosted on a virtual machine (VM). The OPeNDAP Catalog is used to serve data to third-party applications, such as the IASDT dashboard, and provides an interface for users to access input/output data stored in the IASDT.

The OPenDAP Catalog clones some defined data from the HPC into a VM using Docker and serves it using the Data Access Protocol (DAP), which is a defined data model for accessing remote scientific datasets. The magic here is that DAP allows users to query subsets of the data files, while automatically giving variable-level access (see example), and automatically assigning metadata to the contents of each file (see example).

Dashboard

The IASDT dashbaord is created using RShiny, and is linked to the DT OpenDAP server. The dashboard will be used to present the results of the IASDT to users and stakeholders in a simplified and intuitive manner.

Sample screenshot:

Metadata and RO-Crates

The IASDT uses the Research Object Crate (RO-Crate) metadata standard to describe the data and workflows. The RO-Crate metadata standard is a community-driven specification for packaging research data with associated metadata. The RO-Crate metadata standard is designed to be machine-readable and human-readable, and it is designed to be used with a wide range of research data types, including datasets, software, and workflows.

We will use the PyDidIt software (developed in-house) for generating workflow crates and the RO-Crate Python library for generating RO-Crate metadata for the data. The RO-Crate metadata will be stored in the same directory as the data, and it will be used to describe the data and the workflows that generated the data.

Containerization

Parts of the IASDT (specifically modelling) are containerized using Singularity containers. The containers are built using the Singularity containerization software and are used to package the IASDT modelling code and dependencies. The containers are used to run the IASDT modelling code on the HPC system, and they are used to ensure that the code runs in a consistent environment across different systems.

License

Cite as

@misc{biodt_iasdt_2025,
  author       = {Khan, Taimur},
  title        = {BioDT: Invasive Alien Species Digital Twin (IASDT) Workflows},
  year         = {2025},
  month        = {02},
  note         = {Biodiveristy Digital Twin (BioDT), Funded by the European Union }
  doi          = {10.5281/zenodo.14756907},
}

Author

  • Taimur Khan, Helmholtz centre for environmental research - UFZ | Workflows, Architecture, HPC, Data Processing, Containerization, Opendap server

Contributors

  • Ahmed El-Gabbas, Helmholtz centre for environmental research - UFZ | IASDT.R, Modelling, Data Processing
  • Dylan Kierans, German Climate Computing Center (DKRZ) | Containerization
  • Julian Lopez Gordillo, Naturalis Biodiversity Center | Metadata and RO-Crates
  • Oliver Wooland, University of Manchester | RO-Crates(Pydidit)
  • Allan Souza, Institute for Atmospheric and Earth System Research INAR | Dashboard
  • Tuomas Rossi , CSC – IT Center for Science Ltd. | HPC, Containerization
  • Jakub Konvicka, IT4Innovations | HPC (LEXIS)

About

Data and Model workflows for the Invasive Alien Species Digital Twin (IASDT)

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •  

Languages