R.I.P. – the data warehouse. You had a good run, from your origins in the 1980s to your obsolescence in the second decade of the new millennium. Like all products of human ingenuity, you addressed several critical problems and you provided a definite set of useful functions. Ultimately, however, you outlived your usefulness. Turns out, the problems you addressed and the functions you provided were specific to a certain time and place. Once conditions changed, you were dead tech, a relict remainder.
This is the epitaph advocates of decentralized data architectures are eager to write about the data warehouse. And, lest there be any confusion, these people have come to bury the data warehouse, not to praise it. “We have to accept that we cannot move all data to one place anymore,” commented Gavin Robertson, CTO with WhamTech, a Dallas-based vendor that markets a data fabric service it calls SmartData Fabric, on an episode of DM Radio. “It’s just a part of life now. And the idea of having a central data warehouse that’s going to answer all your questions is just – I mean, that’s decades old.”[i]
Robertson makes a very good point: why do we still expect to move data from the edge to the center? Especially in an era in which data tends to be widely distributed in time and space? Isn’t there a better way? And isn’t the data warehouse itself inextricably bound up with a restrictive “shopkeeper” governance model that makes it onerous for people to get the data they need? Isn’t it difficult to add new sources to the warehouse? To make this data available to consumers? In sum: can’t we do better?
“We don’t want IT to be the bottleneck anymore,” said Eric Kavanagh, host of DM Radio, a weekly, nationally syndicated, data management-themed radio show. During his remarks, Kavanagh aptly summarized the views of the data warehouse’s discontents – i.e., the executives, directors, managers, analysts, data scientists, software engineers, and others who (43 years on) want as little to do with the warehouse as possible. “We don’t want any bottlenecks. You want to have a fluid experience with data and analysis as you’re leveraging the power of insights to change your business. A lot of times you want that stuff to be lights out, you don’t even want to have a human who must be in the loop.”
‘Tis a consummation devoutly to be wished.
Data fabric and data mesh as two different takes on decentralization
There are two primary forces for decentralization: data mesh architecture and data fabric technology.
Proponents of both schemes tend to make common cause against centralized data architectures – and their figural avatar, the data warehouse. In practice, however, data fabric and data mesh are two quite different things, at least conceptually. Complicating matters is the fact that several core data fabric technologies – data virtualization, metadata cataloguing, and knowledge discovery – are also used in data mesh architecture. Within limits, it is accurate to say that the technological building blocks of data mesh architecture are similar – and, for most purposes, identical – to those of the data fabric.
Conceptually, data mesh architecture is a radically different way of thinking about how an organization manages, governs, and uses its data: the data mesh is a decentralized lifecycle architecture that aims to formalize the set of concepts, policies, processes, and practices that are bound up with the creation, maintenance, and reuse of data. The data fabric, by contrast, is less a lifecycle architecture than an essential enabling technology: similar, in its way, to JBOD as compared to RAID storage.
To this end, the data fabric is usually positioned as a decentralized means of facilitating access to data.
Each scheme is biased in its own way. Data mesh is biased in favor of managing data in a distributed, decentralized architecture; data fabric is biased in favor of distributed, decentralized data access. These are two very different priorities. True, both schemes valorize a self-service ethic: e.g., an emphasis not just on local control, but on consumer autonomy. And, yes, the priorities of access and reuse are implicit in the logic of data management, just as that of management is implicit in the logic of data access. Ultimately, however, their basic biases determine the aims and priorities of both schemes.
Data mesh as a radical reimagination of data architecture
There’s another important difference, too: data mesh architecture has an ideological dimension that the data fabric does not. Like the data fabric, it is decentralized by design: if you grok the basics of mesh networking, or the concept of a service mesh, you basically grok what a data mesh is and how it’s different. Unlike the data fabric, however, data mesh architecture is politically decentralized: it draws on the concepts of domain-driven design to reapportion responsibility for the ownership and provisioning (or production) of data. In other words, data that is produced (and primarily consumed) by finance is owned and served by finance. Ditto for sales and marketing, HR, etc. Data no longer flows from the periphery to the center; rather, data resides where it gets produced and is accessed in situ.
If a consumer in HR wants data that is produced and owned by finance, she goes to finance. If an analyst wants data that is produced and owned by finance, sales, and/or logistics, she goes to these domains. In practice, she doesn’t actually “go” anywhere: data mesh, too, relies on data virtualization, metadata cataloging services, and other technologies (e.g., automated, ML-powered data discovery, profiling, and cleansing/preparation tools) to knit together a far-flung constellation of data producers.
Moreover, the “domains” that populate the data mesh need not correspond to canonical business function areas. A “domain” can be any functional unit that is attached to or affiliated with the business. So, for example, practice areas – e.g., data science, machine learning (ML) and artificial intelligence (AI) engineering, DevOps – product groups, individual business units (or groups within individual business units) can comprise their own domains, with their own customs, rules, definitions, etc., too.
The emphasis, as with multi-domain master data management (MDM), is on accommodating diversity – in the form of local, domain-specific knowledge, customs, values, priorities, etc. – while at the same time making each domain’s data accessible to consumers affiliated with other, non-local domains.
To this end, the data mesh conceives of data itself as a product. So, for example, individual domains create and maintain their own data “products” – basically, data pipelines that are instantiated in a semantic layer, in apps and services, or (e.g., as views) in local relational database systems.
An abstraction layer – for example, the data fabric itself – arbitrates between local (i.e., domain-specific) and organization-wide rules, customs, definitions, etc. In this way, data mesh architecture proposes to make each domain’s data legible to consumers affiliated with other, non-local domains.
“[The data mesh] opens up flexibility and agility for the teams themselves. They don’t have to climb up the ladder to get some approved schema change, for example, they don’t have to beg for their data source to be loaded somewhere – you’re offloading a lot of that heavy lifting to the different domain groups,” Kavanagh summed up. “And that allows them to become self-sufficient while still feeding the centralized data store. Such that you still have that strategic view from a senior executive perspective.”
Data mesh as a “solution” architecture for distributed data and analytics?
Bruno Aziza, head of data and analytics with Google Cloud, elaborated on the distinction between the data mesh and the data fabric during a discussion with DM Radio host Kavanagh. Aziza sees the mesh as akin to what he calls a “solution” architecture – and not, primarily, as an architecture for managing data. The mesh gives us “a way to think about the problem of centralizing data and federating analytics,” he told Kavanagh. The data fabric, by contrast, is explicitly technological: “I feel like the data fabric is that technology that enables the data mesh, because the data mesh is an organizational context. You don’t buy a data mesh, right? But you could buy a data fabric solution,” Aziza said.
That said, Aziza doesn’t see the data mesh – or the data fabric that underpins it – as solely a tool for decentralization. Rather, he positions it as an enabling technology in a hybrid data architecture.
To illustrate his point, Aziza cited conversations he said he’s had with companies that are balancing the distribution of their data between the periphery and the center. “How do you manage data at scale? The answer is not everything to the edge, and [not] everything to the center. It is probably 50/50. And the data mesh allows you to think about that from not just a technology deployment standpoint, but from an organizational standpoint, which is why I really like the concept itself,” he told Kavanagh.
This vision appears to be consistent with the premise of the data fabric, which is architecturally neutral.
As I wrote in a separate context, “the data fabric gives data architects a way both to tie together otherwise dispersed data resources and to accommodate the unpredictable data access needs of specialized consumers, such as data scientists, ML/AI engineers, and software engineers.”
Very good. But is this “50/50” scheme consistent with the premise of the data mesh, which aims to formalize the lifecycle of producing, maintaining, and reusing data in a decentralized architecture?
Pick a lane: centralized or decentralized
I pose this last as a quasi-open-ended question. Personally, my gut tells me that hybridity of this kind will be difficult to sustain in practice. This is because data management requires a controlling authority – a “sovereign” – of some sort. In practice, this is the nominal role of data architecture, which specifies the set of requirements and functions that are required to “manage” data. However, and to my point, architecture should always have a clear, well-defined purpose.[ii] It should have clear, well-defined boundaries. In the same way, it should clearly define its constraints. Above all, it should be clear to itself about its ground-truth assumptions, and – please note – should not expect to exercise its sovereignty if these assumptions cannot be met. Most of these goals seem impossible to reconcile with hybridity.
Absent an architectural sovereign – e.g., either a centralized, top-down or a decentralized, bottom-up regime – the hybrid situation could metastasize into what medieval historians have called the Zweikaiserproblem: i.e., the problem of two (or more!) emperors. With respect to the problem of data management, the end result might end up looking a lot like what used to be called spreadmart hell, with multiple, simultaneous data silos claiming authority and legitimacy: an Avignon data-ocracy.
This is not to say that a data mesh cannot accommodate centralized data repositories, or vice-versa.[iii]
It is to say that only one architectural regime can be sovereign, be it centralized or decentralized.
This is a big problem that requires a great deal of thought. What we absolutely do not need is organizations going off and implementing poorly thought out data mesh and data lake/house implementations, all the while convinced that both regimes can coexist simultaneously.
We should also recognize that siloing is a natural tendency in human organizations, and that – as we saw again and again in the data warehouse era – the phenomenon of siloing has an essential political aspect. That is, people – individuals, groups, and larger constituencies – tend to express dissatisfaction via different kinds of challenges to established authority. Siloing – the use of unauthorized tools, processes, practices, etc. – was one of the most common challenges to top-down authority. It was not always (or mostly) spiteful, willful, or capricious; in many cases, its occurrence was symptomatic of genuine frustration and dissatisfaction. With this in mind, it is naïve to expect that the factors bound up with siloing – e.g., different kinds of business, economic, social, or technological changes; a divergence between local and organization-wide priorities; anger, envy, resentment, confusion, frustration, and other (usually negative) emotions – will cease to be problems in a decentralized architecture.
My point is two-fold. First, a bifurcated, centralized-decentralized architecture is going to be especially vulnerable to disruption of this kind, such that avoidable siloing will be inevitable. Second, when it comes to centralization versus decentralization, organizations need to go ahead and pick a lane.
And stick with it. I’ll close with a quote that, to me, really gets at what is essential about architecture:
“There are no buildings that have been built by chance, remote from the human society where they have grown and its needs, hopes and understandings, even as there are no arbitrary lines and motiveless forms in the work of the masons. The life and existence of every great, beautiful and useful building, as well as its relation to the place where it has been built, often bears within itself complex and mysterious drama and history.” – Ivo Andrić, The Bridge on the River Drina
[i] The data warehouse was not conceived as a central repository for all data, but, rather, for a derived subset of data – specifically, business-critical data, e.g., the data required to create different kinds of shared, composite representations (analytic views) across systems. In the old days, the bulk of OLTP data never made it into the warehouse. Nor should it have.
[ii] This is implicit in the Greek root arkhé, which connotes a sense of beginning, along with a sense of principle or purpose.
[iii] For example, in a scheme in which data mesh architecture is sovereign, a centralized repository – e.g., a subject-specific data mart, a data warehouse, even a data lake – could be treated as, in effect, just another data producer: i.e., a practice area that constitutes its own autonomous domain and which is vested with local control over the data it produces and manages.