-
Notifications
You must be signed in to change notification settings - Fork 3
Database Information
If you have any database experience, it is very likely that you have worked on relational databases which tend to be more common than graph databases. However, due to the limitations of relational databases (such as the difficulty that arises from trying to query complex information from columns), graph databases are becoming increasingly popular. Graph databases use graph structures such as nodes, edges, and other graph properties to represent and store data. They naturally scale to large data sets because they do not require huge joins operations. Because of their flexibility, they are well suited for data sets that are evolving data schemas. They are often very fast at computing graph-like operations such as the shortest path between two nodes (or genomes).
Graph databases are any storage system that provides index-free adjacency. This means that every element in the database contains a direct link to its adjacent element. This also means that no index lookups are required and every node knows what node or nodes it is connected with, this connection is called an edge. Due to this design, graph databases make use of graph theory to very rapidly find the connections between nodes.
The query language we use to retrieve data is SPARQL, which stands for "SPARQL Protocol and RDF Query Language". It is an RDF query language able to retrieve and manipulate data stored in RDF (Resource Description Framework) format. SPARQL is used to formulate semantic queries in a syntax similar to SQL which are able to answer open-ended questions through pattern matching and digital reasoning.
A triplestore is a type of graph database that is purpose-built for the storage and retrieval of triples through semantic queries. A triplestore is optimized for the storage and retrieval of triples.
A triple is a data entity composed of three part: subject-predicate-object
, like "Bob's age is 35". The subject is the tells us about the resource we are about to describe. The predicate is the property of that resource, and the object is the value of the property.
Triples can usually be imported/exported using RDF data model. For example, one way to represent the notion "The sky has the color blue" in RDF is as the triple: a subject denoting "the sky", a predicate denoting "has the color", and an object denoting "blue".
.ttl
files (Turtle)
RDF is a data model, and it can be represented using different formats. One of these formats is the Turtle format. It is easy to read and it is used in the SuperPhy project. These files contain triples in the RDF model. These files usually contain @prefix
tag at the top. These define the URI's (Uniform resource identifier) in a short-hand, and these are used to identify the resource being described. The subject
and the predicate
are prefixed with the URI short-hand, the object
does not have to be.
This RDF model for displaying data is used to represent data in a way that the computer can easily read and process.
Ontologies are used to describe concepts and terms used in the Blazegraph database...
....
<img src="https://github.com/superphy/version-1/blob/master/App/Pictures/blazegraph.png" width="150", align="left">
Blazegraph is high performance graph database platform that provides supports RDF/SPARQL APIs and the Apache TinkerPop™ stack with scalable solutions including embedded, HA, scale-out, and GPU-acceleration. It is a standards-based, high-performance, scalable graph database written entirely in Java.
According to the website, the database is an open source platform that supports multi-tenancy and can be deployed as an embedded database, or as a standalone server. It has been under development since 2006 and it offers support subscriptions for both commercial and open-source users. [1]
For the SuperPhy project, our deployment mode is a standalone database, and our operating mode is triples. We use Blazegraph to store all the information on Escherichia coli genomes and any associated analysis done on this genome. The design of our graph database is explained further below.
We have used a script to downloads data from various sources (e.g. NCBI Genbank), then maps the data to our triples ontology. These triples are then stored in our database.
The process for uploading data to the database is handled by scripts developed by the team. When the database is running, requests are submitted to it, to either retrieve or add data.
Currently, the SuperPhy platform is the only way to upload genomes. This uploading process is a bulk upload that populates the database with the select data. There is currently no functionality to upload or add new genomes "on the fly". The data currently consists of genomes, their associated metadata, and sequences.
Here is an example using books, that shows how triples are inserted into a database using SPARQL.
Generally, the syntax will take the form:
INSERT DATA [ INTO <uri> ] *
{ triples }
The snippet below describes two RDF triples to be inserted into the default graph of the RDF store.
PREFIX dc: <https://purl.org/dc/elements/1.1/>
INSERT DATA
{ <http://example/book3> dc:title "A new book";
dc:creator "A.N.Other" .
}
When using the SuperPhy application and you click in a link that calls a view function on the page, data that is stored in your local copy of the database is fetched and displayed to the screen. You can also load up our Blazegraph workbench and write queries in the "Query" tab.
Similar to SQL, the general form of a query looks something like this:
SELECT * WHERE { pattern }
Continuing with the example about adding data, here's another book example below which shows a SPARQL query to find the title of a book from the information in the given RDF graph. The query consists of two parts, the SELECT
clause and the WHERE
clause. The SELECT
clause identifies the variables to appear in the query results, and the WHERE
clause has one triple pattern.
Data:
<http://example.org/book/book1> <http://purl.org/dc/elements/1.1/title> "SPARQL Tutorial" .
Query:
SELECT ?title
WHERE
{
<http://example.org/book/book1> <http://purl.org/dc/elements/1.1/title> ?title .
}
Query Result:
Title |
---|
"SPARQL Tutorial" |
[2] [Graph Databases] (http://www.seguetech.com/blog/2013/02/04/what-are-the-differences-between-relational-and-graph-databases)
[3] Ontotext
[4] SPARQL; SPARQL Insert; SPARQL Queries
[5] Triplestore
[6] RDF
[7] Semantic queries