SWR Harvesters

A component to fetch metadata from remote sources as documented at https://soilwise-he.github.io/SoilWise-documentation/technical_components/ingestion/.

Harvesting tasks can best be triggered from a tast runner, such as a CI-CD pipeline. Configuration scripts for running various harvesting tasks in a Gitlab CI-CD environment are available in CI. Tasks are configured using environment variables. The result of the harvester are ingested into a PostGres storage, where follow up processes pick up the results.

flowchart LR
    c[CI-CD] -->|task| q[/Queue\]
    r[Runner] --> q
    r -->|deploys| hc[Harvest container]
    hc -->|harvests| db[(temporary storage)]
    hc -->|data cleaning| db[(storage)]
    db -->|triplify| TS[(Triple store)]
    db -->|indexing| CT[Catalogue]

This component is tightly related to the triple store component and catalogue component. Harvested records are stored on the triple store as well as the catalogue storage.

Harvest in full data flow

flowchart LR
 i[inspire] -->|iso19139| hc
 o[OpenAire] -->|oaf| hc
 f[fao] -->|iso19139| hc
 de[data.europa.eu] -->|DCAT| hc
 hc[Harvest process] -->|rdf| dbtemp[(db-temp)]
 dbtemp -->|sql| aug[augmentation]
 aug -->|sql| dbtemp
 dbtemp -->|sql| pycswingest[pycsw-ingest]
 pycswingest -->|sql| dbrecords[(db-records)]
 dbrecords -->|sql| solrize
 solrize -->|xml| solr[(SOLR)]
 solr -->|json| solrui[SOLR UI]
 dbrecords -->|sql| pycsw[pycsw]
 pycsw --> csw(csw)
 pycsw --> html[catalogueUI]
 dbtemp -->|rdf| vrt[(virtuoso)]
 vrt -->|rdf| sparql(SPARQL)
 sparql -->|json| html

The following harvesting tasks are available.

Fetch records

CSW (for example Bonares, EJP Soil, islandr, inspire)
ESDAC a dedicated API
Cordis/OpenAire combination of SPARQL and API's
Prepsoil a dedicated API
Newsfeeds imports newsfeeds from soil mission websites

Process records

iso-triplify exports iso19139 records to GeoDCAT-AP to be included in SWR triplestore
record-to-pycsw exports records to catalogue (as iso19139 or Dublin Core)
translate triggers a translation of non english records

Docker

Run script as docker. Create a .env file with harvester details.

docker build -t soilwise/harvesters .
docker run --env-file csw/.env soilwise/harvesters python csw/metadata.py

Database

Create script for harvest tables

CREATE SEQUENCE IF NOT EXISTS harvest.sources_source_id_seq
    INCREMENT 1
    START 1
    MINVALUE 1
    MAXVALUE 2147483647
    CACHE 1;

CREATE TABLE IF NOT EXISTS harvest.sources
(
    source_id integer NOT NULL DEFAULT nextval('harvest.sources_source_id_seq'::regclass),
    name character varying(99)  NOT NULL,
    description character varying(255) ,
    url character varying(99) ,
    schedule character varying(99) ,
    type character varying(99) ,
    filter character varying(255) ,
    turtle_prefix text ,
    CONSTRAINT source_pkey PRIMARY KEY (source_id),
    CONSTRAINT source_name_key UNIQUE (name)
)

CREATE TABLE IF NOT EXISTS harvest.items
(
    identifier text  NOT NULL DEFAULT ''::text,
    identifiertype character varying(50) ,
    itemtype character varying(50) ,
    resultobject text  NOT NULL,
    resulttype character varying(50) ,
    uri text  NOT NULL DEFAULT ''::text,
    insert_date timestamp without time zone,
    title text ,
    source text ,
    hash text  NOT NULL DEFAULT ''::text,
    turtle text ,
    date character varying(10) ,
    error text ,
    language character varying(9) ,
    project text ,
    downloadlink text ,
    downloadtype text ,
    CONSTRAINT item_pkey PRIMARY KEY (identifier, uri, hash),
    CONSTRAINT item_sources_name_fkey FOREIGN KEY (source)
        REFERENCES harvest.sources (name) MATCH SIMPLE
        ON UPDATE CASCADE
        ON DELETE NO ACTION
);

CREATE INDEX IF NOT EXISTS ix_items_itemtype
    ON harvest.items USING btree
    (itemtype  ASC NULLS LAST);

CREATE INDEX IF NOT EXISTS ix_items_source
    ON harvest.items USING btree
    (source  ASC NULLS LAST);

CREATE TABLE IF NOT EXISTS harvest.item_duplicates
(
    identifier text  NOT NULL,
    identifiertype character varying(50) ,
    source text  NOT NULL,
    hash text  NOT NULL,
    CONSTRAINT item_duplicates_pkey PRIMARY KEY (identifier, hash),
    CONSTRAINT duplicate_sources_name_fkey FOREIGN KEY (source)
        REFERENCES harvest.sources (name) MATCH SIMPLE
        ON UPDATE CASCADE
        ON DELETE NO ACTION
)

Name		Name	Last commit message	Last commit date
Latest commit History 396 Commits
CI		CI
cordis		cordis
csw		csw
esdac		esdac
impact4soil		impact4soil
ipchem		ipchem
iso-triplify		iso-triplify
isric-library		isric-library
mcf-from-git		mcf-from-git
mcf-triplify		mcf-triplify
newsfeeds		newsfeeds
openaire		openaire
prepsoil		prepsoil
record-to-pycsw		record-to-pycsw
soilmission		soilmission
translate		translate
utils		utils
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.releaserc.yaml		.releaserc.yaml
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
README.md		README.md
pyvenv.cfg		pyvenv.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SWR Harvesters

Harvest in full data flow

Fetch records

Process records

Docker

Database

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

soilwise-he/harvesters

Folders and files

Latest commit

History

Repository files navigation

SWR Harvesters

Harvest in full data flow

Fetch records

Process records

Docker

Database

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages