Skip to content

Database Information

Bernice-B edited this page Jun 30, 2016 · 25 revisions

Graph Databases

If you have any database experience, it is very likely that you have worked on relational databases which tend to be more common than graph databases. However, due to the limitations of relational databases (such as the difficulty that arises from trying to query complex information from columns), graph databases are becoming increasingly popular. Graph databases use graph structures such as nodes, edges, and other graph properties to represent and store data. They naturally scale to large data sets because they do not require huge joins operations. Because of their flexibility, they are well suited for data sets that are evolving data schemas. They are often very fast at computing graph-like operations such as the shortest path between two nodes (or genomes).

Graph databases are any storage system that provides index-free adjacency. This means that every element in the database contains a direct link to its adjacent element. This also means that no index lookups are required and every node knows what node or nodes it is connected with, this connection is called an edge. Due to this design, graph databases make use of graph theory to very rapidly find the connections between nodes.

What is a Triplestore database

A triplestore is a type of graph database that is purpose-built for the storage and retrieval of triples through semantic queries. A triplestore is optimized for the storage and retrieval of triples.

What is a Triple

A triple is a data entity composed of three part: subject (node) - predicate (edge) - object (node), like "Bob's age is 35". The subject is the tells us about the resource we are about to describe. The predicate is the property of that resource, and the object is the value of the property.

Triples can usually be imported/exported using RDF data model. For example, one way to represent the notion "The sky has the color blue" in RDF is as the triple: a subject denoting "the sky", a predicate denoting "has the color", and an object denoting "blue". RDF schemas (OWL onatologies) can be thought of as collections of triples, and so we can run SPARQL queries against them.

Turtle (.ttl) files

RDF is a data model, and it can be represented using different formats. One of these formats is the Turtle format. It is easy to read and it is used in the SuperPhy project. These files contain triples in the RDF model. These files usually contain @prefix tag at the top. These define the URI's (Uniform resource identifier) in a short-hand, and these are used to identify the resource being described. The subject and the predicate are prefixed with the URI short-hand, the object does not have to be.

This RDF model for displaying data is used to represent data in a way that the computer can easily read and process.

SPARQL

The query language we use to retrieve data is SPARQL, which stands for "SPARQL Protocol and RDF Query Language". It is an RDF query language able to retrieve and manipulate data stored in RDF (Resource Description Framework) format. SPARQL is used to formulate semantic queries in a syntax similar to SQL which are able to answer open-ended questions through pattern matching and digital reasoning. SPARQL endpoints are web services that accept SPARQL queries on the web, process them, and send results back, e.g dbpedia.org, Blazegraph.

** Important points to help understand SPARQL**

Look at the data stored in turtle format below:

@prefix pr: <http://examplewebsite.com/addressbook#>
@prefix d:  <http://examplewebsite.com/data#>

d:i2001 pr:firstname "Pierre" ;
        pr:lastname "Nom" .
        pr:telephone "222-111-111" .

d:i1555 pr:firstname "Rosa" ;
        pr:lastname "Nom" .
        pr:telephone "111-111-111" .

d:i8301 ab:firstName "Anil" ;
        ab:lastName  "Darg" ; 
        ab:email     "anildarg@yahoo.com" 

The subject for the first triple is d:i2001, a predicate is firstname, and an object is the value Pierre. The query below returns "Rosa Nom", because that is the only person/triple with the phone number specified in the query, and those are the two variables that were chosen to be displayed.

PREFIX ab:  <http://examplewebsite.com/addressbook#>

SELECT ?first ?last #these will be column names of returned data
WHERE
{
  ?person ab:homeTel    "111-111-111" .    #stores the person triple with this phone number in this variable
  ?person ab:firstName  ?first . 
  ?person ab:lastName   ?last . 
} 

# ?first and ?last are variables to hold the properties we want to get back, and are also the displayed predicates.
# ?person is the temporary variable to store the returned triple that match the conditions

SPARQL FUNCTIONS

Some useful functions to know include SPARQL's filter, optional "WHERE" clause, exclusions, unions, sorting/ordering, COUNT, SUM/AVG, e.t.c.

  • OPTIONAL Returns a property if it has a value, blank if it does not have a value.
PREFIX ab:  <http://examplewebsite.com/addressbook#>

SELECT ?first ?last ?workTel
WHERE
{
  ?s ab:firstName ?first ;
     ab:lastName ?last .
  OPTIONAL 
  { ?s ab:workTel ?workTel . }
}

  • UNION combines the results of two separate WHERE conditions, these two query results remain separate

  • OFFSET x begins displaying the resulting data after x number of rows.

  • LIMIT x limits the number of results returned to x.

  • COUNT(?x) counts the number of times the x property occurs.

  • SUM(?x) sums all the values of x in the results returned from query (similar to AVG).

  • Subqueries - allow you to write select statements within a global select statements, has a unique syntax. It differs from the UNION function.

  • Blank nodes - a way to group related triples in .ttl. Also used to search

See this link for some detailed video explanations.

Overview: Superphy Ontologies

Ontologies are used to describe concepts and terms used in the Blazegraph database. The ontologies can be viewed and examined in detail using Protégé.

....

<img src="https://github.com/superphy/version-1/blob/master/App/Pictures/blazegraph.png" width="150", align="left">

Introduction to Blazegraph

Blazegraph is high performance graph database platform that provides supports RDF/SPARQL APIs and the Apache TinkerPop™ stack with scalable solutions including embedded, HA, scale-out, and GPU-acceleration. It is a standards-based, high-performance, scalable graph database written entirely in Java.

According to the website, the database is an open source platform that supports multi-tenancy and can be deployed as an embedded database, or as a standalone server. It has been under development since 2006 and it offers support subscriptions for both commercial and open-source users.

How do we store data in Blazegraph?

For the SuperPhy project, our deployment mode is a standalone database, and our operating mode is triples. We use Blazegraph to store all the information on Escherichia coli genomes and any associated analysis done on this genome. The design of our graph database is explained further below.

We have used a script to downloads data from various sources (e.g. NCBI Genbank), then maps the data to our triples ontology. These triples are then stored in our database.

How do we add new data?

The process for uploading data to the database is handled by scripts developed by the team. When the database is running, requests are submitted to it, to either retrieve or add data.

Currently, the SuperPhy platform is the only way to upload genomes. This uploading process is a bulk upload that populates the database with the select data. There is currently no functionality to upload or add new genomes "on the fly". The data currently consists of genomes, their associated metadata, and sequences.

Here is an example using books, that shows how triples are inserted into a database using SPARQL.

Generally, the syntax will take the form:

INSERT DATA [ INTO <uri> ] *
{ triples }

The snippet below describes two RDF triples to be inserted into the default graph of the RDF store.

PREFIX dc: <https://purl.org/dc/elements/1.1/>
INSERT DATA
{ <http://example/book3> dc:title    "A new book";
                         dc:creator  "A.N.Other" .
}

How do we query data from Blazegraph?

When using the SuperPhy application and you click in a link that calls a view function on the page, data that is stored in your local copy of the database is fetched and displayed to the screen. You can also load up our Blazegraph workbench and write queries in the "Query" tab.

Similar to SQL, the general form of a query looks something like this:

 SELECT * WHERE { pattern } 
If you would like more information, follow the following links.

[1] Blazegraph website, [Graph Databases] (http://www.seguetech.com/blog/2013/02/04/what-are-the-differences-between-relational-and-graph-databases), Ontotext

[2] SPARQL; SPARQL Insert; SPARQL Queries

[3] Triplestore, RDF, Semantic queries, Ontologies in information sciences

[4] Very good video series on Ontologies

Clone this wiki locally