-
Notifications
You must be signed in to change notification settings - Fork 5
Statistical Criteria
LODStats is analyzing RDF datasets from different CKANs according a set of configured criteria using a stream-based approach. Datasets from CKAN are available either serialised as a le (in RDF/XML, N-Triples and other formats) or via SPARQL endpoints. Serialised datasets containing more than a few million triples (i.e. data items) tend to be too large for most existing analysis approaches as the size of the dataset or its representation as a graph exceeds the available main memory, where the complete dataset is commonly stored for statistical processing. LODStats' main advantage when compared to existing approaches is its superior performance, especially for large datasets with many millions of triples, while keeping its extensibility with novel analytical criteria straightforward.
While describing existing criteria more in detail in the following we will re-use a few things:
//A triple is given by
s p o //wherin "s" is a subject, "p" a predicate and "o" an object
//A triple pattern is given by
?s ?p ?o //wherin ?s is a variable for subject, ?p for predicate and ?o for object
G = directed graph,
M = map,
S = set,
i, len = integer.
+= and ++ denote standard additions on those structures,
i.e. adding edges to a graph, increasing the value of the key of a map,
adding elements to a set and incrementing an integer value.
###1. Used Classes
This criterion is used to create a list of classes that are in use by instances of the analyzed dataset. As an example of such a triple that will be accepted by the filter is aksw:Ivan rdf:type lodstats:Developer
. If such an triples is accepted the IRI will be added to the set of classes (or better the respective IRI)
Filter rule
?p=rdf:type && isIRI(?o)
Action
S += ?o
###2. Class usage count To count the usage of respective classes of a dataset, the filter rule that is used to analyze a triple is the same as in the first criterion. As an action a map is being created having class IRIs as identifier and its respective usage count as value. If a triple is conform to the filter rule the respective value will be increased by one. To return the top 100 classes used in the dataset the respective postprocessing step will be executed.
Filter rule
?p=rdf:type && isIRI(?o)
Action
M[?o]++
Postprocessing
top(M,100)
###3. Classes defined
To get a set of classes that are defined within a dataset this criterion is being used. Usually in RDF/S and OWL a class can be defined by a triple using the predicate rdf:type
and either rdfs:Class
or owl:Class
as object. The filter rule illustrates the condition used to analyze the triple. If the triple is accepted by the rule, the IRI used as subject is added to the set of classes.
Filter rule
?p=rdf:type && isIRI(?s) &&(?o=rdfs:Class||?o=owl:Class)
Action
S += ?s
###4. Class hierarchy depth Description is coming soon
Filter rule
?p = rdfs:subClassOf && isIRI(?s) && isIRI(?o)
Action
G += (?s,?o)
Postprocessing
hasCycles(G) ? inf. : depth(G)
###5. Property usage This criterion is used to count the usage of properties within triples. Therefor a set will be created containing all property IRI's as identifier. While analyzing a respective triple its predicate will be added to the set (if its not added already) and the corresponding value (usage count) will be increased by one. To create a list of the top 100 predicate usages of a dataset the illustrated postprocessing step will be executed.
Action
M[?p]++
Postprocessing
top(M,100)
###6. Property usage distinct per subject Description is coming soon
Action
M[?s] += ?p
Postprocessing
sum(M)
###7. Property usage distinct per object Description is coming soon
Action
M[?o] += ?p
Postprocessing
sum(M)
###8. Properties distinct per subject Description is coming soon
Action
M[?s] += ?p
Postprocessing
sum(M)/size(M)
###9. Properties distinct per object
Description is coming soon Action
M[?o] += ?p
Postprocessing
sum(M)/size(M)
###10. outdegree Description is coming soon
Action
M[?s]++
Postprocessing
sum(M)/size(M)
###11. indegree Description is coming soon
Action
M[?o]++
Postprocessing
sum(M)/size(M)
###12. Property hierarchy depth Description is coming soon
Filter rule
?p=rdfs:subPropertyOf && isIRI(?s) && isIRI(?o)
Action
G += (?s,?o)
Postprocessing
hasCycles(G) ? inf. : depth(G)
###13. Subclass usage Description is coming soon
Filter rule
?p = rdfs:subClassOf
Action
i++
###14. Triples This criterion is used to measure the amount of triples of a dataset. So, if a triple is analyzed a respective counter will be increased by one.
Action
i++
###15. Entities mentioned
To get a count of entities (resources / IRIs) that are mentioned within a dataset, this criterion is used. The action that will be processed is extracting all IRIs from the analyzed triple (iris({?s,?p,?o})
) and increase a respective counter by the amount of extracted IRIs.
Action
i+=size(iris({?s,?p,?o}))
###16. Distinct entities To get a set/list of distinct entities of a dataset all IRIs are extracted from the respective triple and added to the set of entities. If an IRI (entity) is already in the set of entities it will be overwritten to prevent multiple occurrences of entities.
Action
S+=iris({?s,?p,?o})
###17. Literals To get the amount of triples that are referencing literals to subjects the illustrated filter rule is used to analyze the respective triple. If the object of a triple is recognized as a literal a respective counter is being increased by one.
Filter rule
isLiteral(?o)
Action
i++
###18. Blanknodes as subject To get the amount of blanknodes used as subjects the illustrated filter rule is used to analyze the respective triple. If the subject of a triple is recognized as a blanknode a respective counter is being increased by one.
Filter rule
isBlank(?s)
Action
i++
###19. Blanknodes as object To get the amount of blanknodes used as objects the illustrated filter rule is used to analyze the respective triple. If the object of a triple is recognized as a blanknode a respective counter is being increased by one.
Filter rule
isBlank(?o)
Action
i++
###20. Datatypes
Usually in RDF/S and OWL literals used as objects of triples can be specified narrower. On the one hand its possible to define its datatype by using ^^
and a corresponding datatype etiquette as exemplary illustrated as follows: aksw:Ivan foaf:name "Ivan"^^xsd:string
. On the other hand it is possible to define the language of the literal using an @
as exemplary illustrated as follows: aksw:AKSW foaf:name "Agile Knowledge Engineering and Semantic Web research Group"@en
. If a language etiquette exists the datatype can be concluded automatically as xsd:string
.
To get the datatypes used in a dataset and its respective counts the illustrated filter rule is used to analyze the respective triple. If the object of a triple is recognized as a literal the datatype of the triple is being extracted as you can see in the Action step. The extracted datatype is being added to a respective set (if its not added already) and the respective usage counter will be increased by one. If the literal object didn't contain any datatype definition nothing happens and the next triple will be analyzed.
Filter rule
isLiteral(?o)
Action
M[type(?o)]++
###21. Languages
Usually in RDF/S and OWL literals used as objects of triples can be specified narrower. On the one hand its possible to define its datatype by using ^^
and a corresponding datatype etiquette as exemplary illustrated as follows: aksw:Ivan foaf:name "Ivan"^^xsd:string
. On the other hand it is possible to define the language of the literal using an @
as exemplary illustrated as follows: aksw:AKSW foaf:name "Agile Knowledge Engineering and Semantic Web research Group"@en
. If a language etiquette exists the datatype can be concluded automatically as xsd:string
.
To get the language definitions used in a dataset and its respective counts the illustrated filter rule is used to analyze the respective triple. If the object of a triple is recognized as a literal the language definition of the literal is being extracted as you can see in the Action step. The extracted language definition is being added to a respective set (if its not added already) and the respective usage counter will be increased by one. If the literal object didn't contain any language definition nothing happens and the next triple will be analyzed.
Filter rule
isLiteral(?o)
Action
H[language(?o)]++
###22. Average typed string length Description is coming soon
Filter rule
isLiteral(?o) && datatype(?o)=xsd:string
Action
i++;
len+=len(?o)
Postprocessing
len/i
###23. Average untyped string length Description is coming soon
Filter rule
isLiteral(?o) && datatype(?o) = NULL
Action
i++;
len+=len(?o)
Postprocessing
len/i
###24. Typed subjects Description is coming soon
Filter rule
?p = rdf:type
Action
i++
###25. Labeled subjects Description is coming soon
Filter rule
?p = rdfs:label
Action
i++
###26. Usage of owl:sameAs Description is coming soon
Filter rule
?p = owl:sameAs
Action
i++
###27. Links Description is coming soon
Filter rule
ns(?s) != ns(?o)
Action
M[ns(?s)+ns(?o)]++
###28. Maximum per property {int,float,time} Description is coming soon
Filter rule
datatype(?o)={xsd:int|xsd:float|xsd:datetime}
Action
M[?p]=max(M[?p],?o)
###29. Average per property {int,float,time} Description is coming soon
Filter rule
datatype(?o)={xsd:int|xsd:float|xsd:datetime}
Action
M[?p]+=?o;
M2(?p)++
Postprocessing
M[?p]/M2[?p]
###30. Subject vocabularies Description is coming soon
Action
M[ns(?s)]++
###31. Predicate vocabularies Description is coming soon Action
M[ns(?p)]++
###32. Object vocabularies Description is coming soon
Action
M[ns(?o)]++