This repository contains the open‑source code, data, and examples supporting the paper:
Make soil healthy again: Construction of ontology‑compliant soil health knowledge graph with large language models B. Wang, L. Moreira de Sousa & A. Fensel
Soil health is fundamental to environmental sustainability and food security, yet relevant knowledge remains fragmented across diverse sources, hindering its effective application. Knowledge graphs (KGs) offer a robust solution by integrating disparate information into a structured, semantically rich format. Addressing this need, we present an ontology-compliant soil health knowledge graph (SHKG) derived from domain literature. Our KG comprises 10,996 RDF triples that represent 2,023 entities (including 1,791 soil-related concepts), aligned with established ontologies. We employed a KG construction pipeline that utilizes large language models (LLMs) to accelerate the construction of such a KG. The resulting KG was validated through its ability to answer a series of competency questions reviewed by soil science experts, and the KG's factual representation was reviewed and confirmed by them as well. Finally, we propose several potential applications for our KG. The KG, ontology schema, and associated datasets are made publicly available here.
This work draws on the following primary resources:
- EEA (2023). Soil monitoring in Europe – Indicators and thresholds for soil health assessments.
- EEA (2024). The state of soils in Europe – Fully evidenced, spatially organised assessment of the pressures driving soil degradation.
The high‑level structure of the SHKG follows the conceptual model from the EEA 2023 report (Figure 1.1). We have RDF‑ized this model into our top-level schema:
- RDF representation: see
top_level_KG.ttl
Illustration of the high-level structure of the soil health KG:
We utilized a pipeline that incorporates LLMs for the extraction of relevant information from the source text, followed by post-processing and alignment with established ontologies:
.
├── LICENSE
├── README.md
├── requirements.txt # Python dependencies
├── KGC_pipeline.ipynb # Jupyter notebook demonstrating the full KG‑construction pipeline
├── uk2us.py # Utility script (UK ↔ US spelling normalizer)
├── widoco.properties
│
├── top_level_KG.ttl # High-level structure of the SHKG, derived from the conceptual model (RDF/Turtle)
├── soil_health_KG.ttl # Full Soil Health KG (RDF/Turtle)
├── shKG_metadata.ttl # Metadata describing the KG
├── example_SWR.trig # Example SoilWise knowledge repository (TriG)
│
├── example_sparql_queries/ # Example SPARQL queries
├── ex_ontovocabs/ # Linked external vocabularies & thesauri
├── in_ontovocabs/ # Imported ontologies & schemas
├── benchmarks/
│ ├── text_RDF_gs.json # Text-to-RDF gold standard benchmark
│ └── CQs_SPARQL_ea.json # Competency question, SPARQL query, and expected answer dataset for KG validation
├── imgs/
└── …
-
Clone this repository
git clone https://github.com/soilwise-he/soil-health-knowledge-graph.git cd soil-health-knowledge-graph
-
Install dependencies
pip install -r requirements.txt
-
Explore the KG
-
Load the main graph in Python or any RDF tool:
from rdflib import Graph g = Graph().parse("soil_health_KG.ttl", format="turtle") print(len(g), "triples loaded")
-
Run example SPARQL queries in
example_sparql_queries/
or via the public endpoint at: https://repository.soilwise-he.eu/sparql/
-
-
Run the pipeline Open and run
KGC_pipeline.ipynb
to see:- LLM‑driven triple generation (via GPT‑4o prompts)
- Turtle syntax check & repair
- Ontology alignment, entity normalization & relation disambiguation
- KG enrichment (invertible relations, external vocabularies)
- KG validation
- Example SoilWise knowledge repository (interlink with harvested Zenodo metadata records)
- Interactive Browser: https://soilwise-he.github.io/soil-health
- SPARQL Endpoint: https://repository.soilwise-he.eu/sparql/
- Searchable Vocabulary Browser: https://voc.soilwise-he.containers.wur.nl/
To ensure our soil health KG aligns with recognized standards, we incorporate a variety of well-established ontologies and schemes.
- SKOS Core
- Dublin Core
- RDF Schema
- Agrontology
- Semanticscience Integrated Ontology (SIO)
- Open Biological and Biomedical Ontology (OBO)
- QUDT
- Ontology of Units of Measure (OM)
- PROV-O
- Schema.org
- SWEET ontology
- Wikidata
- Biolink Model
- Allotrope Foundation Ontology
- REPRODUCE-ME Ontology
- BioAssay Ontology (BAO)
- Time Ontology
The KG leverages 20 classes and 205 properties drawn from above ontologies to formally define the types of entities and their relationships. All 20 classes come from existing ontologies, while 45 of the 205 properties are defined by us and the rest come from existing ontologies.
The KG is enriched by interlinking to controlled vocabularies and thesauri in the field of soil science to align with standard terminologies.
- Semantic Backbone for a broader SoilWise knowledge repository, an example of interlinking with harvested Zenodo metadata records is provided.
- Natural‑language Question Answering over the KG via NL → SPARQL
- Benchmark for text2KG: converting scientific text → ontology‑compliant RDF
-
Concept-specific comments
To leave comments on any individual concept, visit the VocView portal, search for your concept of interest, then scroll down to the Comments section (as shown in the screenshot below) and post your feedback directly there. -
Missing concepts
If you believe a soil‑health concept is missing from the SHKG, please open a new GitHub issue to let us know.
This work was supported by the EU's Horizon Europe research and innovation programme within the SoilWise project (grant agreement ID: 101112838).
See Issues for planned tasks and enhancements.
- Code: MIT License See
LICENSE
- Data & Ontologies: CC BY 4.0 (Creative Commons Attribution 4.0 International)