In our blog entitled Is Another Database Race About to Start? we presented an overview of Resource Description Framework (RDF) technology and why it is significant. So now let’s delve deeper into the magic behind using a semantic database: SPARQL.
We can start by bowing down to SQL as the default query language for analytics. Despite its limitations and shortcomings, it will not be dethroned in the foreseeable future, if ever. The relational databases that occupy the bulk of the world’s data centers are built for SQL queries, and nowadays even NoSQL databases employ SQL.
SQL is great for retrieving data that is stored in regular sets (i.e., relational tables), but it has no ability to derive meaning from data, and it is weak and often hopeless at navigating its way through networks of data via the relationships within sets (such as who knows who knows whom).
SPARQL (from SPARQL Protocol and RDF Query Language) is the query language for data stored in an RDF format, which is a collection of subject-predicate-object triples. We will address this in depth in another article, but at a high level it is enough to note that when data is stored in an RDF format, all the relationships inherent in the data are preserved, including its meaning. In general, SPARQL is to the semantic web and graph analytics (more on this later, too) what SQL is to the relational database.
Anyone who knows SQL can easily learn SPARQL, as the two languages are similar. Both use commands such as SELECT, WHERE and GROUP BY, and both use the same naming conventions for functions such as aggregates and string functions. But the similarities end when it comes to how the languages work.
Even with a complex data warehouse in place, performing a SQL query against multiple data sources is difficult. Data sources must be mapped, and schemas must be written. Metadata needs to be managed, and when the data itself changes or when new data sources are added, the whole model needs to be updated.
SPARQL, on the other hand, is inherently schema free and allows for federation. Users can immediately start working with new data sources without having to go through IT, and with a single query, request data from multiple locations at once. Such locations can include Web data, internal enterprise databases or external sources such as partner data, public data sets or social media streams. This federated approach allows users to query any collection of databases as if it were one, and because it requires no schema, when data or data sources change, nothing needs to be done.
This is achieved by the use of SPARQL endpoints, which are basically URLs that accept a SPARQL query and return the results. Most large public data sets provide APIs to the endpoints, and there are several publicly available resources for finding them. Internal data can also be converted to RDF format with relative ease, making enterprise assets SPARQL-ready. This makes sharing data sets extremely flexible and accessible. By allowing access to new data sources on the fly and enabling queries to federate across multiple data sources, SPARQL achieves an enormous advantage over SQL, both in respect to time to insight and ease of use.
For example, imagine that you want to find the population, area and median income of all U.S. cities to determine if there is a relationship between population density and income. Using SQL, you would have to query each city database separately, ensure that you have the right data set, then perform complex and time-consuming joins on the data. With SPARQL, one query can target all the data sources and pull the results into a single set. No joins, no coding.
Let’s consider a more business-critical example. In the financial services sector, regulations regarding risk and compliance present an enormous challenge to data management teams. Fraud detection and fraud prevention, in particular, are rendered nearly impossible when data is dispersed across departments and external sources. This is where SPARQL excels. By providing the ability to combine trading data with other data, such as social network feeds and geospatial data, the SPARQL user can quickly find oft-hidden patterns or connections between people and/or entities, thus exposing threats like fraud much sooner.
RDF and SPARQL work hand in glove to deliver a versatile analytics environment. There is no need to consider schema, since RDF data is built on data structures that are flexible by design. SPARQL can work alongside existing database investments, and it provides far more intelligence over the data it can query. This type of technology helps organizations turn big data challenges into big data opportunities.