It is no secret that the database world is in upheaval. It is no secret that a new generation of database products have emerged. And it is no secret that traditional RDBMS products can no longer claim to satisfy all the appetites of Big Data. These RDBMS products are, to give them their due, competent for the storage and retrieval for structured data in moderate volumes, but once the data arrives in real-time streams or rapidly piles up to terabyte levels, the limits of their scalability, flexibility and performance are tested and found wanting.
They fall particularly short when it comes to the management of so-called “unstructured” or “semi-structured data,” where much of the data can be textual and the meaning both of the data and the relationships between data is critical to the applications that use the data. And this can also be the case with many of the new databases and technologies that have recently emerged.
Since the dawn of Big Data, we have been presented with a plethora of possibilities for a variety of processing problems. We’ve observed the march of the column-store databases like Vertica, Infobright and Vectorwise. We’ve experienced the advent of NoSQL solutions such as MongoDB or Cassandra or Aerospike. We’ve seen distributed technology from vendors like NuoDB, Pneuron and EnterpriseWeb. We’ve witnessed the extraordinary growth and adoption of Hadoop with its extensive ecosystem of components. And to be fair, the traditional RDBMSs have to some extent been re-engineered for increasing data volumes and complexity.
What few, if any, of these solutions do well, however, is cater for the meaning of data, by which we mean the discovery, capture and management of metadata. Of course, metadata stored explicitly in traditional databases presents few problems. It’s metadata that is buried in data (for example, via XML) even buried in the software code that presents most of the problems. When we consider the expanding number of data sources and the fact that they can change on a regular basis, managing the metadata resource becomes complex. And moreover, if there is a particular interest in understanding relationships between data elements beyond just the first few hops, that too becomes very cumbersome to pull off with most existing systems.
Ultimately, the solution to this problem lies in the application of semantic technology. And by the way, if the words “semantic technology” fill you with fear, they shouldn’t. This powerful and useful technology is coming of age.
What Makes Semantic Technology Useful?
To understand how a semantic database – they are also called triplestores – works, you must first understand two concepts: the Resource Description Framework (RDF) and the triple. Both are more familiar than you might think.
In simple terms, the RDF is a data model capable of mapping relationships and interpreting data as a graph. It is the underlying method for data exchange on the Web, and it views data in a simple semantic way – in the form of subject-predicate-object expressions; hence the term “triple.”
If we harken back to grammar school days, we may recall a sentence construction example such as, “John has a cat,” where “John” is the subject, “has” is the predicate and “a cat” is the object. In a triplestore, such a sentence is stored as a single record with the meaning attached.
Easy enough, but let’s say that we also have another record, “Maria is the child of John.” A semantic database can map Maria to John, with the inherent relationship intact. Essentially, it preserves the metadata rather than bury it in database schema.
A triplestore therefore persists not just data, but a representation of the data that includes meaning. This is significant. Being able to gauge the analytic value of new data sets without having to pin down potential queries and build schemas is a far more thorough approach to capturing meaning and performing analytics, and it is the reason why semantic technology is now becoming a prominent player in knowledge management applications.
The lingua franca for triplestores is SPARQL, a SQL-like standard language designed specifically for queries over data stored in RDF form or disparate data sources. SPARQL includes the same analytic query functions as SQL, and it can also traverse graph data. Because SPARQL can query resources, such as a URL, you can join RDF data with any other RDF data, from anywhere.
SPARQL queries can happily query a collection of triples in the same way that SQL queries a collection of rows in a table. But it can also carry out graph analysis, allowing users to view and analyze data as the nodes (or points) and edges (or lines) of a graph. The basic triple can be thought of as a two nodes (“John” and “cat” in our example) joined by the edge (“has” in our example). SPARQL queries can reveal entities and relationships or individuals and their connections. And, very usefully, they can pose queries to which the answer is a graph: for example, a graph of “who retweets whom more than once a week.”
It’s easy to see the potential of graph analysis in research areas, where documentation and resources are largely text-based and all the data can be broken down into triples, stored and later queried. But a growing number of application areas are also surfacing, such as insider trading or pharmaceutical development, where analysts need to see relationships between people, behaviors, interactions, etc. Such kinds of analysis are rapidly developing into a promising area of Big Data analytics.
Our point is, of course, that there is yet another type of database technology waiting in the wings to enter the mainstream. It does what other databases cannot do; it specializes in harvesting meaning and allow for easy link traversal.
To the Finish Line
The database race for RDF databases is already in progress, and several vendors are already heavily invested in solutions that scale to the Big Data level. SPARQL City is one such vendor, and SPARQLverse is its database. SPARQLverse is a scalable Hadoop-based massively parallel processing (MPP) analytics engine. It is built around the SPARQL standard, targets analytics use cases, is extremely fast on enormous volumes of data, and right now (although it is early days) it appears to be leaving the competition in the dust.
SPARQLverse works like this. It sits between data sources and analytical applications, running in-memory on commodity clusters. Although it can function as the underlying system of record, it doesn’t have to. And it can scale up to 1024 nodes.
SPARQL City recently ran the well-known SP2 Benchmark against 2.5 billion triples on a sixteen-node Amazon EC2 cluster in the cloud. According to the results, the average query run time was 6 seconds and the aggregate run time for all 17 queries was 102 seconds.
On the surface, this may not seem impressive until you consider that SPARQLverse used 100x more data than any other vendor with reported benchmark results. SPARQLverse outperformed every other graph analytics engine by orders of magnitude, both in data volume and query performance.
Some lucky vendor is going to win this database race, and SPARQL City is currently out ahead of the pack.