Skip to content

Approaches to Solr Indexing

Ankit Lohani edited this page Jun 28, 2018 · 8 revisions

The JSON-LD that we have to index is a nested document as we have seen in [1]. Nested trees have features like-
Point A. The tree can go very deep with new "@type"s being introduced at various depths.
Point B. The tree node can have indefinite no. of parallel child nodes.(eg. multiple properties in additionProperty @type).
The example I gave earlier [1] shows this. The tree is 5 level deep with the leaf(last node) being @type:"CategoryCode" (Point A) and had multiple child nodes for - "mainEntity" > "additionalProperty".

In Solr, we have to index one JSON-LD document. The possible approach I thought initially was-

  1. Create multiple solr core for each type - Dataset, DataRecord, Beacon, BioChemEntity etc. For a given JSON-LD, we will break it into different parts based on @type. Then we will put each part in its own core type and link the parts with an ID. Thus, say I have to make a query - "homo sapiens", it will query all cores and it gets a result X from core C(say), then it will retrieve its corresponding parts for other cores using the id that was used to link those parts. One benefit I could see in this approach is that it will index the data in the most structured way because each core will have a defined schema. This approach has a few demerits like - making multiple requests(in different cores) for one query. How many cores are we ready to create? The bioschemas.org [2] website has 12 specifications defined, ie, at least 12 cores. Apart from these, schema.org itself has too many specifications and making a separate core for each specification complicates and slows down things.

  2. The second approach that I thought was, flattening the JSON-LD that we are operating on. But again, for deeply nested JSON-LDs, flattening the nested documents will be a challenge because we do not know how many parallel child nodes a particular node has (Point B), i.e., we do not know how many columns will be there in our table.

  3. The next approach that I thought of was going schemaless [3] in Solr, i.e., starting the Solr with no schema and creating Solr fields on the go as new data fields are observed in the JSON-LD documents. This would again pose us with challenge explained in above point and may introduce too many columns dynamically.

  4. This is so far the best approach I could think of (and this is what I am working on currently) - Create a nested schema in Solr for handling nested JSON-LDs [4]. This will handle Point A and Point B.

The biosamples in [1] have the following schema -

SCHEMA 1:

JSONLD@type:DataRecord
|
|__Identifier
|__dateModified
|__dateCreated
|__isPartOf@type:Dataset
|  |
|  |__@id
|
|__datasetPartOf@type:Dataset
|  |
|  |__@id
|
|__mainEntity@type:['BioChemEntity', 'Sample']
   |
   |__dataset
   |__name
   |__URL
   |__identifiers
   |__additionalProperty@type:PropertyValue
      |
      |__name
      |__Value
      |__ValueReference

We can create a Solr core with this schema and we are done with biosamples indexing. However, what about any new samples coming in? Say PDBe or Beacon? I have now started thinking of a more general indexer. Bioschemas.org has defined/redefined specifications for 12 types - Beacon, DataCatalog, DataRecord, Dataset, Event, LabProtocol, Protein, ProteinAnnotation, ProteinStructure, Sample, Tool, TrainingMaterial, BioChemEntity. Can we define a nested schema for Solr containing all of these specifications? Yes!

SCHEMA 2:

Document
|
|__JSONLD@type:Beacon
|  |
|  |__.....
|
|__JSONLD@type:DataCatalogue
|  |
|  |__.....
|
|__JSONLD@type:DataRecord
|  |
|  |__.....
|
|__JSONLD@type:Dataset
|  |
|  |__.....
|
|__JSONLD@type:Event
|  |
|  |__.....
|
|__JSONLD@type:Protein
|  |
|  |__.....
|
|__JSONLD@type:ProteinAnnotation
|  |
|  |__.....
|
|__JSONLD@type:ProteinStructure
|  |
|  |__.....
|
|__JSONLD@type:BioChemEntity
|  |
|  |__.....
|
|__JSONLD@type:LabProtocol
|  |
|  |__.....
|
|__JSONLD@type:Tool
|  |
|  |__.....
|
|__JSONLD@type:TrainingMaterial
   |
   |_.....

Any JSON-LD we are trying to index will be one of the 12 types (considering user is only interested in bioschemas.org markup data). This main document further uses nodes of other types from bioschemas.org or schema.org (eg, propertyValue, categoryCode- from schema.org, BioChemEntity, LabProtocol- from bioschemas.org).

Say we need to index a biosample (following SCHEMA 1) in a Solr core (built upon SCHEMA 2). The biosample has multiple parts within it - DataRecord, Dataset, BioChemEntity. We can break the biosample into these three parts and send these parts to respective Solr fields. The problem comes later - additionalProperty is of type @PropertyValue which is described by schema.org but not separately by bioschemas.org. We have created fields for specifications in bioschemas.org only, so how do we accommodate the additionalProperty type in our current Solr core? In fact, this will be the case for many other incoming documents that not only uses bioschemas.org specifications but also some schema.org specifications.

The problem thus remains the same -

How do we create a schema for Solr that can store deeply nested documents with variable and so many @types child nodes?

[1] - https://github.com/buzzbangorg/bsbang-crawler-ng/wiki/Thoughts-on-Solr-Indexing
[2] - http://bioschemas.org/specifications/
[3] - https://lucene.apache.org/solr/guide/7_0/schemaless-mode.html
[4] - https://www.slideshare.net/anshumg/working-with-deeply-nested-documents-in-apache-solr

Clone this wiki locally