Skip to content
hudajkhan edited this page Oct 14, 2025 · 21 revisions

Overview

(This information is also available in Google doc format)

We have an infrastructure that allows for scheduling the extraction of metadata from multiple medata providers. The resulting metadata is mapped to the DataWorks schema, which is an extension of the DataCite schema, and then mapped to Solr documents in our DataWorks index. An experimental prototype UI allows for reviewing the contents of the index.

The DataWorks ETL pipeline currently supports:

  • The automated weekly extraction of metadata from these metadata providers using APIs: SDR, SearchWorks (ICPSR records), Zenodo, Dryad, Redivis, and DataCite. We are also able to retrieve metadata from OpenAlex by explicitly requesting this metadata via the command line. This provider list is based on the project team’s suggestions. We added OpenAlex to this list as this source had multiple Stanford affiliated datasets.
  • The use of pre-specified queries or lists of identifiers to retrieve metadata from these providers. Queries and lists vary by provider, with SDR extraction currently relying primarily on a pre-specified list with identifiers for large datasets for which access was requested via Globus. In this manner, we support the inclusion of datasets based on three different scenarios:
    • When the datasets we wish to add are identified via queries
    • When we don’t have a query to retrieve datasets of interest but do know which datasets we want to include and an API exists for retrieving their metadata
    • When the metadata provider has both query and individual access API mechanisms, but we want to add datasets that are not returned in our queries
  • The addition of dataset metadata modeled in files. This ability supports the addition of datasets of interest that may exist outside any registry or repository, not have an affiliated API, and may be websites or collections of datasets.
  • Deduplication of metadata records representing the same dataset (i.e same DOI or URL) by selecting the metadata record from a primary provider based on an order of preference. In this manner, our index only has one metadata record per DOI/URL. We do capture which other providers also have records for that same dataset. We should then be able to pass this information along to the UI where users can access the different provider landing pages for the same dataset.
  • Merging specific information between providers. Specifically, we are able to capture information about Redivis variables even if we are using the preferred DataCite metadata record. We should be able to expand the properties that can be merged if we need to later.
  • Associating specific datasets with (fake) course ids, to demonstrate the ability of capturing information about which datasets are used for teaching.
  • Transforming metadata from the different providers into DataWorks schema-conformant records, generating Solr documents based on these records, and then loading those documents into the Solr index.

The diagram below provides an overview of the main tasks and capabilities for the ETL pipeline. This image shows the full set of processes used to retrieve information from a single provider, transform, and then load that metadata into our Solr index. The rest of this document provides more explanation about the ETL processes supported by this infrastructure. (View full diagram in Miro).

DataWorks Architecture(2)

Infrastructure capabilities:

  • The system extracts metadata from SearchWorks, Redivis, DataCite, Zenodo, Dryad, and OpenAlex based on queries we have defined to retrieve datasets that may be affiliated with Stanford in some fashion. For SDR, we extract metadata for a hard-coded list of identifiers. (This mechanism is explained further below). (Note: OpenAlex does not have an extraction job but we are able to extract the metadata from OpenAlex manually).
  • Affiliation queries vary by provider. Relationships for which we query include: creators or creators that are affiliated with Stanford, datasets whose funders are Stanford, datasets published by SUL, and datasets whose creators or contributors are specified as Stanford. Some queries use text string matching against “Stanford University” while others use RoR ids representing Stanford University.
  • De-duplication occurs by selecting the dataset from the designated primary provider from a group of datasets with the same DOI or URL. Deduplication allows the resulting UI to only have one dataset record per DOI or URL even though we have metadata from multiple providers for that dataset. Based on feedback from the metadata group, we identified an order of preference for providers which we use during deduplication. For example, since DataCite is preferred over Redivis for metadata, when the system encounters a Redivis dataset and a DataCite dataset with the same DOI, the system uses the DataCite metadata as the primary record. Since the dataset is available via two different providers, the system also saves information about the providers and their related identifiers with the resulting deduplicated dataset.
  • The system also supports nascent metadata merging capabilities. Given a list of properties that can be merged, once the primary dataset record has been chosen, the mergeable properties can be copied over from other providers’ metadata for that same DOI or URL. Currently, we are using this ability only to support the merging of variables from Redivis records in the case where we have a DataCite record equivalent. This method allows us to use the DataCite metadata for a Redivis DOI and also preserve the Redivis variable information as variables are not present in the DataCite schema.
  • The infrastructure also supports hard-coding lists of identifiers per provider. In this manner, we are able to add datasets which may not result from specific queries but that we know are important. We use this mechanism for identifying SDR datasets we would like to add to the system. We were also able to identify which datasets listed in the OpenAlex query results were also in DataCite. Since DataCite provides descriptions via their API but OpenAlex does not provide any abstract information, we hard-coded the list of OpenAlex DOIs we knew were also in DataCite.
  • Additionally, we are able to support the inclusion of datasets that are not available via any API. We define these datasets in YAML files conforming to the DataWorks schema. These YAML files are then loaded into the system. The provider is listed as “Local” for these cases.

Architecture, workflow, and data

Below, we describe how metadata is extracted, transformed, and loaded in the Solr index through the execution of various jobs.

Extraction

DataWorks Architecture

The Extract job retrieves metadata from the provider and then stores the full metadata records into the database. Each provider has their own extraction class which is responsible for executing queries and retrieving full metadata records for the datasets in the results. Some providers obtain only identifiers in their original queries, and then depend on a full metadata retrieval query for each identifier. Other providers, like SearchWorks, can obtain all the metadata needed from the original query. Database details are included in a section further below.

Classes to look for in our code:

  • app/services/clients: base implementation and provider-specific query logic.
  • app/services/extractors: Provider-specific logic passes parameters to provider-specific clients and provides results back
  • app/jobs/*_extract_job.rb: Provider-specific extraction job which calls the extractors, retrieves the results, and saves the metadata for that provider to the database.
  • config/datasets: Lists of identifiers to add for a given provider. For example, config/datasets/sdr.yml lists the SDR druids we want to add to DataWorks

Extraction jobs are scheduled via config/recurring.yml. They are spaced out on different days to allow for a full sequential set of extractions to occur before transformations are applied.

Transformation and loading

DataWorks Architecturet11

The Transform and load job retrieved metadata stored in the database, deduplicates records, converts the metadata from the designated primary provider into the DataWorks schema, then generates Solr documents that are loaded into the index. Datasets are grouped by DOI or URL, designated as an external identifier in the code. If datasets from multiple providers have the same DOI, then we use a list of provider preference to select the preferred provider’s metadata. We capture providers and identifier information at this stage and send that along with the core metadata for that dataset. We can also merge specific fields from other providers into the metadata we captured from the primary provider. Currently, this functionality exists only for variables, so we can keep variables from Redivis even when we are using DataCite as the primary provider.

DataWorks Architecturet2

Once datasets have been deduplicated and metadata merging has occurred, we can now transform the metadata into the DataWorks schema. Each provider has their own code for transforming from their own metadata schema into the DataWorks schema. After the transformed metadata has been validated against our schema, the process continues with converting the DataWorks schema metadata into a Solr record and loading that record into the index.

Validation on each transformed record can succeed or fail, with different consequences based on whether or not the dataset DOI was already identified as one we should ignore. In the simplest cases, when validation succeeds and the dataset is not in our ignore list, we continue on to the creation of a Solr document on the basis of this metadata. If validation fails and the dataset is in our ignore list, we ignore the dataset and do not create a Solr document for this dataset. If validation succeeds, and the dataset was in our ignore list, we trigger a Honeybadger notification without failing the job. If validation fails, and the dataset was not in our ignore list, the job fails.

Classes to look for in our code:

  • app/services/dataworks_mappers: Provider-specific mapping from provider metadata schema to the DataWorks schema
  • app/services: transformer_loader, dataworks_transformer, and dataworks_validator help with deduplication, transforming metadata to the DataWorks schema, validating the metadata, and metadata merging.
  • app/jobs/transform_load_job: the actual job that calls transformer_loader to begin the process described above
  • config/settings.yml: the ignore lists are found under provider names. E.g. “datacite:ignore:” lists all the DOIs we do not wish to include and that are returned from DataCite queries in the extraction step.

Queries for retrieving Stanford affiliated data

Provider Retrieval criteria
DataCite Retrieve datasets where all of the following criteria are met:
  • resource-type-id = 'dataset'
  • Any of the following are true:
    • creator or contributor is Stanford University: creator/creator name identifier = Stanford ROR OR creator/creator name string = "Stanford University
    • creator/contributor is affiliated with Stanford University: creator or contributor affiliation identifier = Stanford ROR OR creator/contributor affiliation name string = "Stanford University"
    • client id = 'sul.openneuro' to retrieven OpenNeuro datasets
    • provider id = "sul" to retrieve datasets uploaded by SUL
Zenodo Retrieve datasets where both of these conditions are true:
  • type = 'dataset
  • creator's affiliation = 'Stanford University'
Dryad affiliation = Stanford ROR
SearchWorks Retrieve datasets where all these conditions are met:
  • Author = "Inter-university Consortium for Political and Social Research."
  • Access = Online
  • Format = Dataset
Redivis Retrieves datasets for all of these organizations:
  • organization = "SUL"
  • organization = "StanfordPHS"
  • organization = "SDSS"
SDR Retrieve datasets from list of druids
OpenAlex The works endpoint retrieves the datasets using the following filters:
  • type = "dataset"
  • institutions.id = Stanford ROR

Example mapping from provider to DataWorks schema

Here is an example of mapping from Dryad to a DataWorks schema record, with the column on the left showing the original Dryad values and the result on the right showing the DataWorks schema result also in JSON format for consistency. We have also left out or truncated some values for readability.

Dryad values DataWorks schema metadata
"identifier": "doi:10.5061/dryad.q573n5tqm" "identifiers": [ { "identifier": "doi:10.5061/dryad.q573n5tqm", "identifier_type": "DOI" } ]
"storageSize": 35974273 "sizes": [ "35974273 KB" ]
"title": "Data from: Geography and ecology shape the phylogenetic composition of Amazonian tree communities" "titles": [ { "title": "Data from: Geography and ecology shape the phylogenetic composition of Amazonian tree communities" } ]
"authors": [ { "firstName": "Bruno", "lastName": "Garcia Luize", "email": "luize.bg@gmail.com", "affiliation": "Universidade Estadual de Campinas (UNICAMP)", "affiliationROR": "https://ror.org/04wffgt70", "orcid": "0000-0002-8384-8386", "order": 0 } ] "creators": [ { "name": "Bruno Garcia Luize", "name_type": "Personal", "name_identifiers": [ { "name_identifier": "0000-0002-8384-8386", "name_identifier_scheme": "ORCID" } ], "affiliation": [ { "name": "Universidade Estadual de Campinas (UNICAMP)", "affiliation_identifier": "https://ror.org/04wffgt70", "affiliation_identifier_scheme": "ROR" } ] } ]
"abstract": "<p><strong>Aim:</strong> Amazonia hosts more tree species, from numerous evolutionary lineages both young and ancient, than any other biogeographic region. Previous studies have shown that tree lineages colonised multiple edaphic environments and dispersed widely across Amazonia</p>", "descriptions": [ { "description": "<p><strong>Aim:</strong> Amazonia hosts more tree species, from numerous evolutionary lineages both young and ancient, than any other biogeographic region. Previous studies have shown that tree lineages colonised multiple edaphic environments and dispersed widely across Amazonia</p>", "description_type": "Abstract" },... } ]
"funders": [ { "organization": "São Paulo Research Foundation", "identifierType": "ror", "identifier": "https://ror.org/02ddkpn78", "awardNumber": "2021/11670-3", "awardDescription": "", "order": 0 } ], "funding_references": [ { "funder_name": "São Paulo Research Foundation", "award_number": "2021/11670-3", "funder_identifier": "https://ror.org/02ddkpn78", "funder_identifier_type": "ROR" } ]
"keywords": [ "Community assembly", "Neotropics", "dispersal limitation", "Environmental selection", "niche conservatism", "Evolutionary Principal Component Analysis", "Moran's Eigenvector Maps", "Indicator Lineage Analysis", "tropical rain forests", "Ecology", "Evolution", "behavior and systematics" ], "subjects": [ { "subject": "Community assembly" }, { "subject": "Neotropics" }, { "subject": "dispersal limitation" }, { "subject": "Environmental selection" }, { "subject": "niche conservatism" }, { "subject": "Evolutionary Principal Component Analysis" }, { "subject": "Moran's Eigenvector Maps" }, { "subject": "Indicator Lineage Analysis" }, { "subject": "tropical rain forests" }, { "subject": "Ecology" }, { "subject": "Evolution" }, { "subject": "behavior and systematics" } ]
"methods": "<p><strong>Geography and ecology shape the phylogenetic composition of Amazonian tree communities</strong></p>\n<p>Data files derived from the abundance of 5,082 tree species in 1,989 plots across Amazonia and used to support the findings described in Luize et al.,</p>", "descriptions": [ ..{ "description": "<p><strong>Geography and ecology shape the phylogenetic composition of Amazonian tree communities</strong></p>\n<p>Data files derived from the abundance of 5,082 tree species in 1,989 plots across Amazonia and used to support the findings described in Luize et al., </p>", "description_type": "Methods" } ]
"relatedWorks": [ { "relationship": "article", "identifierType": "DOI", "identifier": "https://doi.org/10.1038/s42003-023-05514-6" }, { "relationship": "primary_article", "identifierType": "DOI", "identifier": "https://doi.org/10.1111/jbi.14816" } ], "related_identifiers": [ { "related_identifier": "1365-2699", "related_identifier_type": "ISSN" } ]
"versionNumber": 8, "version": "8"
"publicationDate": "2024-03-07", "publication_year": "2024", "dates": [ { "date": "2024-03-07", "date_type": "Issued" },.. ]
"lastModificationDate": "2024-03-07", "dates": [ …, { "date": "2024-03-07", "date_type": "Updated" } ]
"visibility": "public",
"license": "https://spdx.org/licenses/CC0-1.0.html" "access": "Public"
"provider": "Dryad"
"sharingLink": "http://datadryad.org/dataset/doi:10.5061/dryad.q573n5tqm", "url": "http://datadryad.org/dataset/doi:10.5061/dryad.q573n5tqm"

More details about how fields map across providers to the DataCite schema can be found in this spreadsheet: Metadata and Data Curation: Iteration 1 on the tab labeled ‘Schema Mapping V1’.

Index mapping overview

Here is a breakdown of how information is mapped to various fields. Our indexing approach for this workcycle was aggressive in trying to map as many fields to facets as possible. With the UI/UX workcycle, these decisions will likely change.

The table below shows how the information is mapped to one or more Solr fields, and whether the field will be available on item page (i.e. show), as a facet, and whether the value will be copied to the all text search which will allow matches to happen through the keyword search box.

Info Solr field Show Facet copied to all text search
id Id (this is the DOI or URL with handling for special characters) x
access access_ssi x x
provider provider_ssi x x
provider identifiers provider_identifier_ssi x
doi doi_ssi x x
title title_tsim x x
subtitle subtitle_tsim x x
alternative title alternative_title_tsim x x
translated title translated_title_tsim x x
other title other_title_tsim x x
creators creators_ssim x x x
creators_struct_ss x
creators ids creators_ids_sim x
funders funders_ssim x x x
funding_references_struct_ss x
funders ids funders_ids_sim x x
url url_ss x
contributors contributors_ssim x x x
contributors_ids_ssim x x x
contributors_struct_ss
publisher publisher_ssi x x x
publisher_id_sim x
related_identifiers related_ids_sim x
related_identifiers_struct_ss x
publication year publication_year_isi x x
subjects subjects_ssim x x x
descriptions descriptions_tsim (abstract or no type) x x
methods_tsim (methods) x x
other_descriptions_tsim (other descriptions) x x
language language_ssi x x
size sizes_ssm x
formats formats_ssim x x x
version version_ss x
rights rights_uris_sim x
rights_list_struct_ss x
affiliations of creators and contributors affiliation_names_sim x
temporal_coverage temporal_isim x x
variables variables_tsim x x x

Database overview

This overview section provides a quick view of the table structure for the dataworks_etl_rails_development database with some example data. There are other databases in use that are needed for the SolidCable/SolidQueue configuration which are not described here.

Dataset records

This table contains the metadata from the provider for any given dataset. The modified token is specified by provider and allows the system to see whether or not to replace the metadata with an updated version upon running the extract job again. Modified tokens may be versions for some providers while they may be the source_md5 value also stored in the table for other providers.

id provider dataset_id modified_token doi source_md5 source created_at updated_at
18714 dryad doi:10.5061/dryad.q573n5tqm 8 10.5061/dryad.q573n5tqm fbea105e33a3f4692e7819da21f764e2 { "id": 126259, "title": "Data from: Geography and ….} 2025-05-14 18:36:01.348421 2025-05-14 18:36:01.348421

Dataset record sets

This table captures information about the extraction job for a particular provider along with the arguments that were sent to the extractor. For example, in the table below, we see the extraction job for Dryad ran using the Dryad Extractor code with the arguments for the Stanford University RoR id for affiliation. We capture the arguments because the same extractor can have different arguments passed for different extraction jobs. This information is used in the transformation step when we retrieve datasets from different providers and need to see which datasets are associated with which provider and which set of arguments.

id provider complete job_id created_at updated_at extractor list_args
68 dryad true "166297d5-c35b-4c7e-af49-f22b2c010efa" 2025-05-14 18:35:18.39373 2025-05-14 19:03:01.181667 Extractors::Dryad {"affiliation":"https://ror.org/00f54p054"}

Dataset record associations

This table associates a particular “dataset record set” with a given dataset record.

dataset_record_set_id dataset_record_id created_at updated_at
68 18714 2025-05-14 18:36:01.370631 2025-05-14 18:36:01.370631

SolidQueue/ActiveJobs

As noted above, we are using SolidQueue in this project. From our staging environment, the jobs Mission control page can be reached via “/jobs”. We are using ActiveJobs and SolidQueue to handle our Extraction and TransformLoad job scheduling and execution. config/recurring.yml has the list of jobs to be executed.

An emergent dependency between jobs is how we would like the extraction jobs to all finish before we execute the transform job. We are scheduling the extraction jobs, with one a day, with the transform and load job scheduled for the final day in the schedule.