Old-guard data warehousing stalwarts tend to react viscerally when the subject of the data lakehouse comes up. Watch them closely and you’ll see their brows knit sternly and their neck muscles begin to twitch, their jaws clenching and their pursed lips whitening, until, almost imperceptibly, their mouths contort into a kind of gruesome rictus, as if thinking to themselves: “Oh, the humanity! Go to your happy place. Find your power animal. This, too, shall pass.”
But will it? Is the data lakehouse another in a series of much-hyped “disruptions” – cough, Hadoop, Blockchain – or is it something else again? Certainly, the bourgeois patness of the term counts as one strike against it. (It could be worse. Why not a data lake dacha?) Quite aside from its too-pat labeling, however, doesn’t the debut of the data lakehouse get at something that’s at once novel and kind of inevitable: the reimagination – the translation – of data warehouse architecture?
“The really interesting part from my perspective is that we used to have this monolithic-style architecture for data warehouses … and now, in this next generation … each component is modular, so the query engine is its own entity, and the storage, well, [that is] Amazon S3 for example,” observed Eric Kavanagh, host of DM Radio, a weekly data management-themed radio show, during a recent episode exploring the data lakehouse.
The data lakehouse, as Kavanagh explained, is usually implemented as a cloud SQL query or DBMS-like service. It is layered atop a cloud data lake service, which is itself superimposed over a cloud object storage substrate such as Amazon S3, Google Cloud Storage, or Microsoft Azure Blob Storage. This makes sense. Not only have most enterprise data lakes shifted from the on-premises data center to cloud infrastructure, but cloud object storage is home to a rapidly increasing share of all enterprise data, as well.
Right now, Kavanagh noted, businesses are maintaining both data warehouses and data lakes. Could new cloud data lakehouse services tempt some of them to pull the plug on their data warehouses? Maybe. At a minimum, he argued, the emergence of the data lakehouse invites that most persistent of questions: Why do we still have our data warehouse?
“The forward-looking companies … are going to realize, “Well, wait a minute, we don’t want to have to pay for both,” he pointed out. “If I can get the performance I need from a lakehouse architecture, let’s just go down that road and it saves so much on the ETL side.’”
One of Kavanagh’s guests was Dipti Borkar, chief evangelist with Ahana, a company that develops a managed SQL query service based on the open source Presto SQL query engine.
Borkar noted that while the idea of querying data in S3 and similar cloud storage services is not new, the ability to perform inserts or updates is. “This is a pretty fundamental change in the data lake architecture and the stack, because now you can actually run … almost all workloads [that run] on the data warehouse on the lake. And that’s … game changing,” she asserted.
Die daten warenhausdämmerung!
Then again, Borkar would say that, wouldn’t she? After all, Ahana’s is one of a chorus of vendor voices that proposes to disintermediate (an ungainly, but apt, word) the seemingly ineradicable data warehouse.
But aren’t she and Ahana also on to something? The company’s pitch is certainly straightforward enough: instead of (first) ingesting data into the data lake and (second) engineering it for use with a data warehouse, why not just query it in situ – in the lake itself?
“Now you have the benefit of running a SQL engine in place without moving [or] ingesting the data into another system,” Borkar pointed out.
On the one hand, the simplicity of this vision elides a great deal of nuance. So, yes, one reason you engineer data prior to loading it into the warehouse is so that you can model it. And one reason you model data is so that you – or, more precisely, a database – can efficiently store, retrieve, and perform operations on it.
However, another reason you model data is so you can manage, govern, and (a function of both) make better use of it: so you understand what it is, where it came from, how it was engineered, who accessed it, what they did to it, etc. You manage and govern data so you can more effectively disseminate it, so consumers can more reliably share it, and so non-expert consumers, especially, can dependably access or discover the data they need when they need it. Lastly, you manage and govern data so that people trust it. This is the nuanced explanation as to why you model data; modeling to improve data warehouse performance is just one part of it. The upshot is that data modeling is indissolubly bound up with data management and governance.
On the other hand, Ahana and similar vendors do not presume to eliminate data modeling as such; rather, they argue in favor of changing the context in which businesses model data, as well as the kind of modeling they must perform. This is a quite different argument.
In conventional data warehouse architecture, businesses perform several modeling steps. They begin by engineering data so that it conforms to a predefined 3NF, Data Vault, or dimensional data model. They then load this data into the warehouse. In most cases, too, businesses opt to design denormalized views that they customize for different types of use cases. These views, instantiated in what used to be called a BI layer, also help simplify access for BI tools.
Data lakehouse proponents have two things to say about this. The first (as we’ve seen) is that businesses can now query and update data in the lake itself without also performing heavy-duty data engineering. The second is that businesses can shift most data modeling logic into the BI or semantic layer. In either case, there is no longer a requirement to model data for the warehouse. This argument reduces to the following logic: we used to have to model data twice – first so we could store, retrieve, and process it efficiently; second, so we could make use of it for BI and analytics – but we don’t have to do that anymore. Modern compute resources are fast and scalable enough to eliminate the need for data warehouse-specific data modeling.[1]
Behold: the data lakehouse
Borkar sees this insight, or something like it, as the beginning of the post-data warehouse era – or, if that’s too extreme, as a totally new phase in the evolution of what used to be called data warehouse architecture.
“We are seeing users who are even pre-pre-data warehouse,” she said “They’re still running analytics on MySQL and Postgres in the cloud, and now they have an option … where [they] say ‘Hey, I could move to a lake … or I could still use a warehouse.’”
She speculated that “some of them will just skip the warehouse and move to a … lakehouse because [they] finally have the option to actually do that.”
Li Kang, vice president of North American operations with Kyligence, a vendor that develops an analytic services and management platform for big data based on the Apache Kylin project, expanded on this idea.
“We’re definitely seeing the convergence of the data platform architectures,” he told Kavanagh, referring to the consolidation of data and analytic processing workloads into a single context – the data lake – which supports a diversity of practices, such as data engineering, data science, ML/AI engineering, and decision support. As Kang sees it, even established data warehousing vendors are hip to this phenomenon: “We’re also seeing the data warehouse vendors … expanding their traditional data warehouse concept to cover the more broad use cases and workloads.”
He acknowledged the inherent ambiguity of this “convergence” – e.g., is the data warehouse converging toward the data lake, or vice-versa? – but argued that the answer to this question doesn’t actually matter. “The data lake now … has much better data support, data quality support, workload management, security … and [the] data warehouse [is] now expanding … to support unstructured data, or external files,” Kang said, suggesting that (because of this convergence) many businesses are coming to see in the data lakehouse the logical synthesis of both.
His larger point was not that the data warehouse is obsolete, but, rather, that it is undergoing a kind of evolutionary supersession. In this sense, the data lakehouse marks the translation of data warehouse architecture into cloud-native concepts and principles. “We’re seeing this trend of convergence … from [the] traditional lake and the traditional warehouse, and that’s why you’re seeing this [data] lakehouse concept right here, the coming [at it] from both angles,” he said.
Lost – and found – in translation
This doesn’t mean the data warehouse is poised to slip the surly bonds of the enterprise data center, never to return. It means that data warehouse architecture is getting translated into – adapted to, retrofitted and reimagined for – a very different paradigm: cloud-native design. The data lakehouse is just one example of this.
In this regard, it is useful to distinguish between “cloud” in its concrete aspect – e.g., the public cloud; the market for SaaS, PaaS, IaaS, FaaS, etc. cloud services – and “cloud” in the abstract: i.e., cloud as designating a basic set of cloud-native concepts, methods, and enabling technologies. This cloud-native “cloud” is actually equally at home in either the hyperscale public cloud or in the enterprise data center. The advent of on-premises hyper-converged infrastructure is a great example of this: i.e., a way of achieving cloud-like resource elasticity in the on-premises environment.
All of this is to say that businesses will likely deploy data lakehouse-like services wherever they host or maintain data lakes – in the public cloud and in the on-premises environment. Irrespective of where it is deployed, however, the architecture of the data lakehouse will little resemble the archetypal data warehouse. Unlike that system, the lakehouse is engineered to run atop pooled, virtualized compute, storage, and network resources. To exploit the elasticity that is the defining feature of this software-defined infrastructure, the lakehouse’s constitutive functions are decomposed and loosely coupled. So, for example, instead of consolidating most application functions – e.g., data storage and data management; query optimization and query processing; etc. – into a software monolith, as with an RDBMS, the data lakehouse implements each as a discrete service. In their totality, these services constitute the lakehouse and its functions; however, no one service is dependent on any other. This is basic cloud-native design.
Abstraction at this level permits a kind of functional elasticity, with the result that services can be orchestrated independently and provisioned dynamically, e.g., in response to specific events, such as an API call, a trigger or alert, a pattern of activity, etc. These are logical and necessary – indeed, evolutionary – changes. There is much to like in this.
So we’ve seen that the architecture of the lakehouse little resembles that of the classic data warehouse – but what about its functions? Its roles and responsibilities? Do these change, too? For the most part, no, they do not. Because they do not cease to be necessary.
And this is what worries me. This is what causes my brow to knit sternly and my neck muscles to twitch, my jaw to clench and my pursed lips to whiten. This is why I suspect there is something wrong with my liver – old-guard data warehousing stalwart that I am.
Epilogue: Let’s not [censored] this up
“The table drawn up after the dreams of every day, week, and month have been collected, classified, and studied must always be absolutely accurate. To this end not only is there an enormous amount of work to be done in processing the raw material, but it is also of the utmost importance that the [Palace of Dreams] should be closed to all external influence.” – Ismail Kadare, The Palace of Dreams
If the data lakehouse is to be more than just an awkward pun in translation, if it is to supersede and supplant the data warehouse, it must do justice to the core principles of data warehouse architecture, starting with the warehouse’s foundational role in managing and governing data, as well as its instrumentality in producing consistent, replicable results.
Like it or not, the Ur role of the data warehouse is as the arbiter of ground truth for all business-critical information in the enterprise. This has less to do with a chimaeric “single version of the truth” than with its role in authorizing a kind of qualified objectivity: in the context of financial reporting, strategic decision-making, and long-term planning – or with respect to any of the countless operational applications and services that obtain results from the warehouse tens of thousands of times each day – the raison d’être of data warehouse architecture is to ensure that the business is supplied with consistent, objectively valid data. Objectivity in this sense “implies nothing about truth to nature [and] has more to do with the exclusion of judgment, the struggle against subjectivity,” as Theodore M. Porter aptly put it in his 1995 book, Trust in Numbers. I do not think we do an injustice to Porter if we amend his definition to exclude arbitrariness and accidentally, too.
In this respect, an institution like the data warehouse also plays a special role in producing and indemnifying business knowledge. Its imprimatur is that of objectivity: the facts and figures that it produces are accessible to, and can be replicated by, everybody. This objectivity is the ultimate ground for discussion, consensus, and action. It makes the business legible to itself – i.e., amenable to control, administration, and oversight – and also permits oversight by outside bodies; at the same time, it shields decision-makers from arbitrary censure (or, worse, sanction) by interests of all kinds – shareholders, juridical organs, regulatory agencies, and etc. “Inevitably, the goal of managing phenomena depends also on convincing an audience,” wrote Porter, referring to the power of the “stamp of objectivity to certify … figures.” He sees objectivity as a construct, a tool, that evolved in businesses and bureaucratic institutions of all kinds: “[T]he relative rigidity of rules … ought to be understood in part as a way of generating a shared discourse, or of unifying a weak community.”
The business knowledge, the facts and figures, produced by the warehouse are “objective” because everybody has the same data. And the reason everybody has the same data is because (a) this data is itself the product of governed, replicable data engineering processes, and (b) data warehouse architecture enforces safeguards which ensure that dirty reads, lost updates, and other anomalies cannot occur during data processing: that n concurrent users always obtain valid, consistent results. So, for example, if 15 operational applications submit exactly the same query at exactly the same time, each gets exactly the same result – even if (a fraction of a second later) a batch update attempts to change this data.
This need for something like a data warehouse in producing, managing, and authorizing objectively valid business knowledge does not go away – especially if a would-be successor aspires to replace it. To do so is (ipso facto) to assume all of its core responsibilities.
This means, one, that any viable data lakehouse implementation must bring forward data warehouse-like data management and data governance capabilities, and, two, that it must effectively balance the overriding demand for performance against the more muted – but still foundational – requirement for strong consistency.
[1] This scheme actually has some precedent in data warehouse design, believe it or not: with the advent of massively parallel processing (MPP) data warehouse appliances, for example, businesses started experimenting with so-called “one-big-table” schemas that eschewed conventional 3NF or dimensional modeling: basically, this schema denormalizes all data in the warehouse into a single “wide” table.
About Stephen Swoyer
Stephen Swoyer is a technology writer with more than 25 years of experience. His writing has focused on data engineering, data warehousing, and analytics for almost two decades. He also enjoys writing about software development and software architecture – or about technology architecture of any kind, for that matter. He remains fascinated by the people and process issues that combine to confound the best-of-all-possible-worlds expectations of product designers, marketing people, and even many technologists. Swoyer is a recovering philosopher, with an abiding focus on ethics, philosophy of science, and the history of ideas. He venerates Miles Davis’ Agharta as one of the twentieth century’s greatest masterworks, believes that the first Return to Forever album belongs on every turntable platter everywhere, and insists that Sweetheart of the Rodeo is the best damn record the Byrds ever cut.