Approaches to Solr Indexing

The JSON-LD that we have to index is a nested document as we have seen in [1]. Nested trees have features like-
Point A. The tree can go very deep with new "@type"s being introduced at various depths.
Point B. The tree node can have indefinite no. of parallel child nodes.(eg. multiple properties in additionProperty @type).
The example I gave earlier [1] shows this. The tree is 5 level deep with the leaf(last node) being @type:"CategoryCode" (Point A) and had multiple child nodes for - "mainEntity" > "additionalProperty".

In Solr, we have to index one JSON-LD document. The possible approach I thought initially was-

Create multiple solr core for each type - Dataset, DataRecord, Beacon, BioChemEntity etc. For a given JSON-LD, we will break it into different parts based on @type. Then we will put each part in its own core type and link the parts with an ID. Thus, say I have to make a query - "homo sapiens", it will query all cores and it gets a result X from core C(say), then it will retrieve its corresponding parts for other cores using the id that was used to link those parts. One benefit I could see in this approach is that it will index the data in the most structured way because each core will have a defined schema. This approach has a few demerits like - making multiple requests(in different cores) for one query. How many cores are we ready to create? The bioschemas.org [2] website has 12 specifications defined, ie, at least 12 cores. Apart from these, schema.org itself has too many specifications and making a separate core for each specification complicates and slows down things.

Yeah, I'm not a fan of this approach because one will need to keep adding cores. I feel this can be done within one core with faceting or nested documents, but I can't tell you yet how to do that! (justincc)

The second approach that I thought was, flattening the JSON-LD that we are operating on. But again, for deeply nested JSON-LDs, flattening the nested documents will be a challenge because we do not know how many parallel child nodes a particular node has (Point B), i.e., we do not know how many columns will be there in our table.

Yeah, this is going to be difficult. I feel that flattening itself is okay at this point, but as below, there is the problem of deciding which of those fields is the most important.

The next approach that I thought of was going schemaless [3] in Solr, i.e., starting the Solr with no schema and creating Solr fields on the go as new data fields are observed in the JSON-LD documents. This would again pose us with challenge explained in above point and may introduce too many columns dynamically.

It might be possible if the JSON-LD in MongoDB is completely normalized and we ignore any fields we don't recognize as being part of that type. But I'm not sure this helps with the general problem of deciding which fields are relevant the user's request.

This is so far the best approach I could think of (and this is what I am working on currently) - Create a nested schema in Solr for handling nested JSON-LDs [4]. This will handle Point A and Point B.

The biosamples in [1] have the following schema -

SCHEMA 1:

JSONLD@type:DataRecord
|
|__Identifier
|__dateModified
|__dateCreated
|__isPartOf@type:Dataset
|  |
|  |__@id
|
|__datasetPartOf@type:Dataset
|  |
|  |__@id
|
|__mainEntity@type:['BioChemEntity', 'Sample']
   |
   |__dataset
   |__name
   |__URL
   |__identifiers
   |__additionalProperty@type:PropertyValue
      |
      |__name
      |__Value
      |__ValueReference

We can create a Solr core with this schema and we are done with biosamples indexing. However, what about any new samples coming in? Say PDBe or Beacon? I have now started thinking of a more general indexer. Bioschemas.org has defined/redefined specifications for 12 types - Beacon, DataCatalog, DataRecord, Dataset, Event, LabProtocol, Protein, ProteinAnnotation, ProteinStructure, Sample, Tool, TrainingMaterial, BioChemEntity. Can we define a nested schema for Solr containing all of these specifications? Yes!

SCHEMA 2:

Document
|
|__JSONLD@type:Beacon
|  |
|  |__.....
|
|__JSONLD@type:DataCatalogue
|  |
|  |__.....
|
|__JSONLD@type:DataRecord
|  |
|  |__.....
|
|__JSONLD@type:Dataset
|  |
|  |__.....
|
|__JSONLD@type:Event
|  |
|  |__.....
|
|__JSONLD@type:Protein
|  |
|  |__.....
|
|__JSONLD@type:ProteinAnnotation
|  |
|  |__.....
|
|__JSONLD@type:ProteinStructure
|  |
|  |__.....
|
|__JSONLD@type:BioChemEntity
|  |
|  |__.....
|
|__JSONLD@type:LabProtocol
|  |
|  |__.....
|
|__JSONLD@type:Tool
|  |
|  |__.....
|
|__JSONLD@type:TrainingMaterial
   |
   |_.....

Any JSON-LD we are trying to index will be one of the 12 types (considering user is only interested in bioschemas.org markup data). This main document further uses nodes of other types from bioschemas.org or schema.org (eg, propertyValue, categoryCode- from schema.org, BioChemEntity, LabProtocol- from bioschemas.org).

Say we need to index a biosample (following SCHEMA 1) in a Solr core (built upon SCHEMA 2). The biosample has multiple parts within it - DataRecord, Dataset, BioChemEntity. We can break the biosample into these three parts and send these parts to respective Solr fields. The problem comes later - additionalProperty is of type @PropertyValue which is described by schema.org but not separately by bioschemas.org. We have created fields for specifications in bioschemas.org only, so how do we accommodate the additionalProperty type in our current Solr core? In fact, this will be the case for many other incoming documents that not only uses bioschemas.org specifications but also some schema.org specifications.

The problem thus remains the same -

How do we create a schema for Solr that can store deeply nested documents with variable and so many @types child nodes?

[1] - https://github.com/buzzbangorg/bsbang-crawler-ng/wiki/Thoughts-on-Solr-Indexing
[2] - http://bioschemas.org/specifications/
[3] - https://lucene.apache.org/solr/guide/7_0/schemaless-mode.html
[4] - https://www.slideshare.net/anshumg/working-with-deeply-nested-documents-in-apache-solr

Response by justincc

Yeah, you're coming up to the really hard part of the Bioschemas search problem - how do we do search over the markup and get relevant results? And honestly, it's not a problem that I've spent much time thinking about. I really don't expect you to solve it in the GSoC timeframe - it's the kind of problem people spend years thinking about.

My hope was that we could start with a naive approach which would yield results that were good enough. Initially, I thought that perhaps we could flatten the entire JSON-LD structure before inserting it as a single document in Solr (as per slide 8 in [4]). Then give some fields greater weight that are more likely to be relevant in search results, such as name and description.

But as you've seen, this has various problems. For instance, how do you distinguish which name field is the most important in a flattened document? For example, name fields in the DataRecord, the BioChemEntity and the PropertyValue in a single structure. And not all structures are the same - the Biosamples example in [1] has DataRecord as the top field, but other pages may simply have a bare BioChemEntity.

I think here, one might inspect the JSON-LD and somehow prioritize name fields that were in a BioChemEntity structure no matter where they were in the hierarchy (but not too deep). Of course, that might not be what the user wants, maybe they're really looking for dataset entries with that string (e.g. searching for "dataset xxxx"). This is another hard problem - one might need to compromise on the user interface by not having a simple search box, but rather forcing the user to choose a facet (are you looking for datasets, genes, events, etc?).

I like the nesting stuff in [4], though, and if it makes sense to pursue it then please do. You may also want to check out the Solr graph query parser which looks like a different take on the same idea.

For fields that are in schema.org and not in Bioschemas, like PropertyValue I think the only thing to do is to add these on a case by case basis.

Eventually, one probably does want to search this stuff as part of a graph - if a user finds a BioSamples entity then we want to link back to the DataSet it comes from (though admittedly for biosamples, it's just one big dataset). In a sense, the website themselves are already giving us this graph - BioSamples will (hopefully!) have only one DataSet embedded somewhere and we can retrieve this by looking for Datasets with the @id "https://www.ebi.ac.uk/biosamples/samples" in MongoDB. This might remove the need for an explicit graph, such as something in Neo4J - don't know yet.

I'm not sure that helps or answers your question! Please keep asking stuff. I guess what I'm saying is that it's worth investigating interesting things like Solr nested documents but in the GSoC timeframe it's fine to have something messy and not particularly accurate, like a completely collapsed Document, just for Biosamples if necessary. However, there's still a question as to whether even that is sufficient for returning minimally useful search results, particularly when one does go beyond a single website. It's also okay to ignore some values for now (e.g. PropertyValue) or lose the generality of the search interface (e.g. force the user to choose whether they are looking for a Dataset, BioChemEntity, etc.) if that's what it takes to have something that works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Approaches to Solr Indexing

Response by justincc

Uh oh!

Uh oh!

Clone this wiki locally