Computers run software that processes data. It sounds deceptively simple, but everyone knows it is not. You don’t need a Ph.D. in computer science to figure that out—a little bit of googling will get you there.
- There is a plethora of different types of data: structured records, text, hypertext, images, video, audio, XML data, program code and more.
- The volume of data is mountainous. We’ve traveled far beyond petabytes, into exabytes and zettabytes, and we’re marching proudly towards yottabytes as I write. If data were made of rock, it would be the Himalayas.
- There is a vast population of individual data files. Hundreds are stored on every cell phone and thousands on every desktop. Many more are squirreled away on websites, but those numbers pale into insignificance compared to the teeming multitudes of files loitering in corporate data centers and or scattered like dust among the clouds.
The management of data always was a big issue: its storage, its transport, and the whole scalability problem.
The March of the Data Platform
Let’s not waste much time itemizing the early evolution of data platforms, from the dinosaur diet of flat files that fed the mainframes to the various database systems that have since become archaeological curiosities. The contemporary era of data platforms began life a mere ten years ago when parallel processing rode into town, strode into the saloon, and grabbed the database industry by the scruff of its neck.
It was a tectonic shift. Big database could be got for big dollars, sure, but suddenly big data could be assembled for small dollars and processed in a speedy parallel manner. And again, let’s not dwell too much on the teething problems. Yes, Hadoop with all its toys was hopelessly immature—only just out of diapers—and its champions made ridiculously optimistic claims—and thus, many businesses built data swamps rather than data lakes. But the direction was right and, as we have seen time and again, technology evolves away from failure towards success.
So this bright new open-source revolution soon gave us Apache Spark, a beautiful, high-performance analytics engine and Kafka, a brilliant data distribution and event streaming capability, and, to be honest, a lot more besides. Still, Spark and Kafka were the two big stars in the firmament.
So let’s step back a little and ask the question: Did something happen to change the whole concept of what a data platform could be and should be?
It sure did.
It was a confluence of several dynamics. First of all, BI looked around and discovered it had a big brother called AI, who had spent its younger years lurking in the shadows and keeping company with statisticians. As if by magic, these statisticians morphed into data scientists and acquired a healthy interest in data sources that had never previously been on the menu. You know the kind of thing: open government data, partner data, log file data, social media data, streaming data, rentable data, and so on.
The way they liked to process data was, in the main, mathematical and usually involved extensive calculations. The traditional databases were built for querying data, and there’s no denying they were good at it, and by the way, that was the kind of service that the BI applications had come to love.
A Data Platform in Concept
Let’s stop there and paint a picture of a data platform, spicing it up with a few nuances not previously mentioned.
The diagram paints a beguiling and, in some ways, complex picture. On the left-hand side, we have a whole set of what we can think of as housekeeping software doing things that need to be done to data to benefit most applications. Let’s walk through them:
- Data governance is about implementing corporate data policy (which can be complex) and, in particular, satisfying regulatory requirements
- Data security (such as encryption routines) is also a kind of governance
- Data cleaning is a perennial need
- Metadata management is about creating a coherent metadata resource and keeping it in order
- MDM is about creating a meaningful business glossary and perhaps sophisticated user data services
- Data lineage is necessary for any analytical processing
- Data lifecycle is what it always was, an essential chore.
There’s a fair amount of processing that needs to be done just to keep the data ship-shape. And that’s aside from the data platform software itself that needs to care about performance management, job scheduling, back-up and recovery, and other such niceties.
On the right-hand side, we show a general picture of the typical applications: search and query capabilities, AI and BI—all such applications that don’t need a particular engine for the sake of speed, and any other apps that need to feed on raw data.
All the line-of-business applications and office applications are elsewhere, feeding data into the data platform. The data platform is or should be the “system of record” and thus, it captures all log files from transactional applications, data which it may need anyway for BI and AI apps.
We show an ETL ingest application at the top, but of course, this is far more complex than just one application. It is most likely quite a few ETL apps that are part of a Kafka network, transporting data into the data platform, both from streams and other data sources.
We have not even tried to show one awkward complexity here on the diagram: specific data streams need to be processed in real-time and hence may interact with other business applications before entering the data platform. And that data usage needs to have its audit trail.
Finally, data may need to be extracted from the Data Platform and pushed to specialized engines (fast databases) for some applications that the data platform does not have the muscle to execute directly.
Who are the Players in This Game?
Let’s first be downright honest about the fact that what we painted was a conceptual picture. First off, even if there were data platforms that could do absolutely everything we have mentioned, all the big organizations of the world would still be far distant from this kind of platform simply because of their legacy technology and legacy applications.
Secondly, the picture we have painted leans more towards the world of data lakes than the world of big data engines. And, of course, data lakes are a relatively recent innovation.
Step back and think about it. Why did no one think to build data lakes before?
Perhaps it’s like the electric car, which seems so evident in the rear-view mirror—I mean fuel at a third of the cost, 20 moving parts rather than 2000 in the engine, almost zero maintenance; what’s not to love? But it took Elon Musk with all his bravado to usher it into the world. And it took an open-source parallel processing initiative to establish the data lake.
To identify the players in the game, we need to distinguish between traditional data warehouse technology and the more modern data lake.
So we need to include the likes of Oracle, Teradata, and Vertica, who support massive data warehouses. The data lake put paid to the hegemony of big data engines because, for most businesses, it proved too expensive to hold masses of data in big data engines when most of the data you want to process arrived in the last three months.
Nevertheless, if you put a competent data lake architecture in front of a big data engine, you may be able to create a very versatile Data Platform.
If we look at the vendors who offer impressive Data Lake technology, Databricks is currently the clear leader. Their approach seems to align with our conceptual model of a Data Platform. They speak in terms of a data lakehouse rather than a data warehouse. Conceptually we approve. A great deal of what needs to happen to data, the housekeeping activities detailed on the left-hand side of our diagram are data lake rather than database applications, so long as the data lake has a definition layer that supports such activity.
Databricks’ lakehouse qualifies. It has a three-layer architecture comprising raw data as the lowest layer, a business layer to define the data from a business level perspective, and a specialization layer into which applications can be slotted. It can run AI, BI, and SQL workloads on the data, and it claims excellent price performance.
Dremio describes itself as a cloud data lake engine. It doesn’t use the term lakehouse, perhaps because Databricks got there first, but it makes similar claims in terms of configurability, expansibility, and price/performance.
You can contrast both of those vendors if you like with Snowflake, which, it’s true, is generally thought of as data warehouse technology. Nevertheless, it talks in terms of providing data lake capability. It supports the claim by offering an exceptionally versatile ingest capability, including data cleansing, governance capabilities, and data self-service for analytics and BI. And of course, let’s not forget that Snowflake is cloud-only.
Finally, we think it worth mentioning to Ahana, which can be best described as SaaS for Presto. PrestoDB, as I’m sure you are aware, is an open-source distributed SQL engine. Ahana does not offer anything like the comprehensive data platform capability of those we’ve already mentioned, but it could be a valuable component of a Data Platform.