Skip to content

soilwise-he/harvesters

Repository files navigation

SWR Harvesters

DOI

A component to fetch metadata from remote sources as documented at https://soilwise-he.github.io/SoilWise-documentation/technical_components/ingestion/.

Harvesting tasks can best be triggered from a tast runner, such as a CI-CD pipeline. Configuration scripts for running various harvesting tasks in a Gitlab CI-CD environment are available in CI. Tasks are configured using environment variables. The result of the harvester are ingested into a PostGres storage, where follow up processes pick up the results.

flowchart LR
    c[CI-CD] -->|task| q[/Queue\]
    r[Runner] --> q
    r -->|deploys| hc[Harvest container]
    hc -->|harvests| db[(temporary storage)]
    hc -->|data cleaning| db[(storage)]
    db -->|triplify| TS[(Triple store)]
    db -->|indexing| CT[Catalogue] 
Loading

This component is tightly related to the triple store component and catalogue component. Harvested records are stored on the triple store as well as the catalogue storage.

Harvest in full data flow

flowchart LR
 i[inspire] -->|iso19139| hc
 o[OpenAire] -->|oaf| hc
 f[fao] -->|iso19139| hc
 de[data.europa.eu] -->|DCAT| hc
 hc[Harvest process] -->|rdf| dbtemp[(db-temp)]
 dbtemp -->|sql| aug[augmentation]
 aug -->|sql| dbtemp
 dbtemp -->|sql| pycswingest[pycsw-ingest]
 pycswingest -->|sql| dbrecords[(db-records)]
 dbrecords -->|sql| solrize
 solrize -->|xml| solr[(SOLR)]
 solr -->|json| solrui[SOLR UI]
 dbrecords -->|sql| pycsw[pycsw]
 pycsw --> csw(csw)
 pycsw --> html[catalogueUI]
 dbtemp -->|rdf| vrt[(virtuoso)]
 vrt -->|rdf| sparql(SPARQL)
 sparql -->|json| html
Loading

The following harvesting tasks are available.

Fetch records

  • CSW (for example Bonares, EJP Soil, islandr, inspire)
  • ESDAC a dedicated API
  • Cordis/OpenAire combination of SPARQL and API's
  • Prepsoil a dedicated API
  • Newsfeeds imports newsfeeds from soil mission websites

Process records

  • iso-triplify exports iso19139 records to GeoDCAT-AP to be included in SWR triplestore
  • record-to-pycsw exports records to catalogue (as iso19139 or Dublin Core)
  • translate triggers a translation of non english records

Docker

Run script as docker. Create a .env file with harvester details.

docker build -t soilwise/harvesters .
docker run --env-file csw/.env soilwise/harvesters python csw/metadata.py

Database

Create script for harvest tables

CREATE SEQUENCE IF NOT EXISTS harvest.sources_source_id_seq
    INCREMENT 1
    START 1
    MINVALUE 1
    MAXVALUE 2147483647
    CACHE 1;

CREATE TABLE IF NOT EXISTS harvest.sources
(
    source_id integer NOT NULL DEFAULT nextval('harvest.sources_source_id_seq'::regclass),
    name character varying(99)  NOT NULL,
    description character varying(255) ,
    url character varying(99) ,
    schedule character varying(99) ,
    type character varying(99) ,
    filter character varying(255) ,
    turtle_prefix text ,
    CONSTRAINT source_pkey PRIMARY KEY (source_id),
    CONSTRAINT source_name_key UNIQUE (name)
)

CREATE TABLE IF NOT EXISTS harvest.items
(
    identifier text  NOT NULL DEFAULT ''::text,
    identifiertype character varying(50) ,
    itemtype character varying(50) ,
    resultobject text  NOT NULL,
    resulttype character varying(50) ,
    uri text  NOT NULL DEFAULT ''::text,
    insert_date timestamp without time zone,
    title text ,
    source text ,
    hash text  NOT NULL DEFAULT ''::text,
    turtle text ,
    date character varying(10) ,
    error text ,
    language character varying(9) ,
    project text ,
    downloadlink text ,
    downloadtype text ,
    CONSTRAINT item_pkey PRIMARY KEY (identifier, uri, hash),
    CONSTRAINT item_sources_name_fkey FOREIGN KEY (source)
        REFERENCES harvest.sources (name) MATCH SIMPLE
        ON UPDATE CASCADE
        ON DELETE NO ACTION
);

CREATE INDEX IF NOT EXISTS ix_items_itemtype
    ON harvest.items USING btree
    (itemtype  ASC NULLS LAST);

CREATE INDEX IF NOT EXISTS ix_items_source
    ON harvest.items USING btree
    (source  ASC NULLS LAST);

CREATE TABLE IF NOT EXISTS harvest.item_duplicates
(
    identifier text  NOT NULL,
    identifiertype character varying(50) ,
    source text  NOT NULL,
    hash text  NOT NULL,
    CONSTRAINT item_duplicates_pkey PRIMARY KEY (identifier, hash),
    CONSTRAINT duplicate_sources_name_fkey FOREIGN KEY (source)
        REFERENCES harvest.sources (name) MATCH SIMPLE
        ON UPDATE CASCADE
        ON DELETE NO ACTION
)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •