In a twist that has inevitable written all over it, the database industry has at last begun to take heed of the power of consumerization. The once mighty RDBMS is now obliged to make room for an emerging and increasingly important partner in the data center: the graph database. Twitter’s doing it, Facebook’s doing it, even online dating sites are doing it; what they are doing is tracing relationship graphs. After all, social is social, and ultimately it’s all about relationships.
There are all kinds of graph databases and projects on the market right now, and most of them are purpose-built for a particular workload or platform. Some technologies are not even databases at all, but rather graph analytics engines that pull data from any convenient source, even Hadoop, and analyze it in the graph engine.
Twitter developed its graph database in house and released it as FlockDB to the open source community. Twitter designed it specifically to store relationships and activity between users. Facebook rolled out Graph Search to allow its users to query its Social Graph to discover connections that extend beyond “so-and-so is a friend of so-and-so.” As far as online dating, it’s primarily a matter of who dated whom and what worked out well, but it’s possible to create complex graphs of who liked or disliked what movies or band or food or sports team or whatever – and to recommend potential matches on such a basis and also to analyze what worked and what didn’t.
For a refresher, in case you’ve forgotten, graph databases are based on the mathematics of graph theory. They store data in terms of nodes, properties and edges. Nodes are entities which can have attributes just like in an RDBMS. The edges are the relationships (i.e., the connections between nodes) and the properties are attributes of the relationship. A simple example: John likes to listen to The Beatles. John (person entity) likes to listen to (property) The Beatles (band entity). John (person entity) watches videos of (property) The Beatles (band entity). So in this simple example: the nodes are person and band, the edge is the connection between person and band and it has properties. In this example, the properties of the edge linking person John to band The Beatles are: “likes to listen to” and “watches videos of.”
The point is that graph databases gather information about relationships between entities and standard relational databases are not built to do that. Now you can model your way around this problem by (in our example) defining a person-band table (i.e., entity) and recording the properties in that table along with the key value pairs that they relate to. Unfortunately, when you store data like that and you ask simple questions such as “what are the bands that John and Eric like but Rebecca and Iris do not like?” the RDBMS takes forever, or at least a long time, to get you an answer. Graph databases and technologies are built to answer such queries and serve up the answers quickly.
The Graph Database and the RDF Database
If, when you read the words “John likes to listen to The Beatles,” your mind said: “hey that’s a subject-predicate-object data triple,” you’re right and you’ve been spending far too much time in the land of RDF databases. The point is that RDF databases logically store data as triples, and hence, like graph databases they are also very good at answering queries that require the database to navigate its way around a graph.
Both RDF databases and graph databases do such things exceptionally well. However they are not exactly the same kind of engine either, at the logical level or the physical level. By definition, RDF databases standardize on the SPARQL query language (SPARQL is a recursive acronym of Sparql Protocol and RDF Query Language). These triple store databases needed a query language that went much further than SQL so that the semantic querying of data would be possible – which in turn would bring the world closer to the much-heralded-but-yet-to-actually-arrive semantic web.
SPARQL is, of course, emerging from its infancy, but it is already powerful. It is not just capable of semantic queries, it is capable of inferencing (or reasoning) with the data, which is a first for a query language. In time this may be a killer capability. You won’t just retrieve information, you will also be able to use the database to deduce new information by examining facts (assertions) in the data.
To lean on a famous example, an RDF database might contain the datum “All men are mortal,” and also the datum “Socrates is a man.” Through applying an inference to this data it would be able to declare new data: “Socrates is mortal.” Moving to another famous example: “Epimenides says that all Cretans are liars” and “Epimenides is a Cretan.” The database might (if programmed to avoid getting into endless loops) point out that here there is a contradiction in the data. Graph databases cannot do such things. But right now that probably doesn’t matter much, because it is still an explorative area of software.
Where the RDF databases really score is when you want to do set processing (a la SQL) at the same time that you want to do graph processing. Consider a query such as “Who are the biggest influencers on Twitter over the past six months?”
Both the RDF and Graph database would handle such a query and return the same results quickly. But if you ask the very different question, “Which influencers have had the same pattern of influence on Twitter over the last six months?” you are asking both for graph processing and set processing at the same time to get to the answer, and the RDF databases do both well. Not only that, but this is an area of analytics, which was virtually untapped until recently, because there was no software that could easily do it.
My personal belief is that this is what will separate the RDF databases from the graph databases in the end. Analytics. Graph analytics.