-
Notifications
You must be signed in to change notification settings - Fork 0
Description
As per #7, I have been looking at canonicalization of all data that goes into MongoDB.
The obvious way to do this is by expanding the JSON-LD, so that all variations disappear (e.g. some markup using the full URLs in keys, some using context, various contexts using different prefixes for namespaces). See the JSON-LD playground to illustrate. All other approaches (flattening, etc.) require a context. We would need to have a uniform context across all JSON-LD crawled, which I feel would be very difficult.
However, it turns out that whilst MongoDB allows periods in keys, these prevent query navigation from working. This is a major problem for then doing any queries (e.g. counting all the entities crawled of type Sample - db.samples.find({ 'mainEntity.@type': 'Sample'}).count()).
How to solve this? So far I've thought of 4 approaches:
1) Replace all periods in key names with a character not valid in URLs
I think the major candidates are < > or ^, so that
http://schema.org/isPartOf
becomes
http://schema^org/isPartOf
Issues
- Fugly
- All queries to the database will need to perform ^ . substitution. This could be reduced by using a standard context for all normalized markup to remove some of the full URL keys, e.g for schema.org
1b) Accept that we won't be able to navigate through keys that contain periods.
Issues
- This will severely curtail the querying that can be done
- Some Mongo APIs don't support this (e.g. pymongo)
2) Use a transformation of the JSON-LD which doesn't require periods in keys names or a context.
The pyld class can generate a normalization representation like this
{
"_id" : ObjectId("5b8802c97a05a824a3cf030b"),
"schema" : {
"@default" : [
{
"subject" : {
"type" : "IRI",
"value" : "https://www.ebi.ac.uk/biosamples/samples"
},
"predicate" : {
"type" : "IRI",
"value" : "http://www.w3.org/1999/02/22-rdf-syntax-ns#type"
},
"object" : {
"type" : "IRI",
"value" : "http://schema.org/Dataset"
}
},
{
"subject" : {
"type" : "blank node",
"value" : "_:c14n0"
},
"predicate" : {
"type" : "IRI",
"value" : "http://schema.org/name"
},
"object" : {
"type" : "literal",
"datatype" : "http://www.w3.org/2001/XMLSchema#string",
"value" : "Organism"
}
},
...
Issues
- Extremely fugly
- Possibly inefficient
- A pain to query
3) Abandon the idea of canonicalization
Issues
- This will make certain queries effectively impossible due to performance constraints of extracting every JSON-LD document and querying across them all
- It doesnt' solve the problem of some website JSON-LD having periods in keys, where they aren't using a context for everything
4) Use a different document database
This is a fairly extreme response but would avoid the issues of mangling periods or similar. My primary candidate would be PostgreSQL which has JSON handling capabilities. Here's an example of syntax.
Issues
- Changing the crawl database will be a pain, though hopefully not too bad since I think only buzzbang-ng and indexer are using it at the moment (?)
5) Put the JSON-LD into a triplestore instead
Issues
- This is a different storage paradigm so more rework required than switching to a different document database. Probably still not all that much, though it will be more of a pain to store metadata since you have to bend your brain into the less intuitive way that RDF wants to do things
- Triplestore development is stagnant.
- This is probably what most people would suggest but the lack of triplestore development really puts me off.
Currently
I'm inclined towards option 4 - changing the document database to PostgreSQL. I don't think this is as major a disruption yet as it might seem. This may be combined with putting schema.org in the standard context to cut down on the length of keys, maybe even largely eliminating this.
Option 1 is also a possibility, though I intuitively don't like all the mangling that would be involved.