Some enterprises are still limping along on legacy data warehouse architecture, partying like it’s 1989.
They’re used to doing things a certain, specific way, to observing certain, specific customs and conventions. Their data management and data engineering, even critical aspects of their BI and analytics – everything is stuck in time. The names of the products they use, along with most of their enabling technologies, may have changed. The contexts in which these IT assets live may have changed, too – i.e., shifting from the on-premises data center to cloud infrastructure.
Conceptually, however, these enterprises are still ingesting data in a certain, specific way, still engineering it in certain, specific ways, and still persisting it into the same types of repositories.
An enterprise of this type is poorly positioned if it wants to transform itself digitally.
“If you want to do digital transformation – and you better do digital transformation, or you’re not going to be around much longer – you have to focus on the data you have to find some mechanism to get it [into the cloud] and it needs to be put into a marshaling area that you can govern and manage because otherwise you’re going to be chasing down loose-ends for the rest of your life,” argued analyst Eric Kavanagh, in the most recent edition of the Bloor Group’s Briefing Room webcast series.
Kavanagh was joined by Vinayak Dave, professional services manager with cloud-native data integration specialist Equalum, and Yves Mulker, founder of 7wData, a data management firm that specializes in solving data integration and management problems for digital transformation projects.
“[Data is] the key pillar for digital transformation, because [everything] we do in the digital world generates data,” Mulker told Kavanagh. Okay, that much is obvious. As Mulker explained however, getting at this data can be surprisingly difficult. A modern digital workflow not only spans multiple contexts – e.g., disparate geophysical locations, cloud infrastructure regions, and cloud services – but also generates data that is unique to each context. So, for example, a person who accesses a retailer’s e-commerce website from her mobile phone generates useful location data on that device. As she browses products on the retailer’s site, she generates useful clickstream data. If she changes something in her account settings, she generates essential customer data. And so on, and so forth.
A subtler point is that the software enterprises create to underpin their digital transformation efforts also generates data – viz., diagnostic data. (Organizations are increasingly instrumenting their software to create and transmit diagnostic data. This is consistent with the logic of observability, a foundational concept in next-gen software architecture.) This data, like that created by the e-commerce workflow described above, is usually dispersed across distinct geophysical and virtual contexts. Success with digital transformation hinges on an ability to put together data from different sources into analytic views that can yield useful insights. Conveniently enough, this is also the key to success in retail and e-commerce; in financial services, insurance, and telecommunications – in all markets, at all times.
So, companies need to be able to consolidate data from disparate sources into different kinds of analytic views. This includes not only the data generated by observable, digitally transformed business processes – finance, HR, sales and marketing, etc. – but data created by different kinds of B2B and B2C processes. The key to successful digital transformation is also the key to success in business.
Full stop.
Digital transformation reconsidered
Digital transformation poses a set of difficult data management problems. How do you integrate data that originates in separate, sometimes geographically far-flung locations? Or, more precisely, how do you integrate data that is widely distributed in geophysical and virtual space in a timely manner?
This last is one of the most misunderstood problems of digital transformation.
Software vendors, cloud providers, and, not least, IT research firms talk a lot about digital transformation. Much of what they say can safely be ignored. In an essential sense, however, digital transformation involves knitting together jagged or disconnected business workflows and processes. It entails digitizing IT and business services, eliminating the metaphorical holes, analog and otherwise, that disrupt their delivery. It is likewise a function of cadence and flow: i.e., of ensuring that the digital workflows which underpin core IT and business services function smoothly and predictably; that processes do not stretch – grind to a halt as they wait for data to be made available or for work to be completed – or contract, i.e., that steps in a workflow are not skipped if resources are unavailable.
“We’re talking about … the ways of getting [data] in the right spots and creating those insights, that’s still a big challenge. Having that digital data allows you to create better baselines and benchmarks to see where are we with our processes, where are we with our organization,” Mulker told Kavanagh.
In this respect, too many enterprises have one foot forward in the future, one foot stuck in the past.
Even as they’re focusing on digitizing manual processes and workflows, they’re relying on outmoded data architectures – themselves premised on outmoded assumptions – to support their digital transformation efforts. Not only do these enterprises lack the ability to tightly knit together the operations in their business workflows, but they lack insight into their operations in breadth and depth.
They cannot intelligently direct their digital transformation efforts because they can measure neither the progress nor the effectiveness of these efforts. “Trying to get a holistic view is still a big challenge,” Mulker said. “What we keep on hearing from the business is ‘Why does it take so long? We ask for something [from] IT, and it takes us several months … before we can get access to the data.’”
The nuts and bolts of digital transformation
To meet this challenge, some enterprises are doubling down on outmoded technologies.
“Outmoded” is a loaded term. It is less that the techs in question are “outmoded” than that they have been superseded for certain purposes. A data warehouse is still a foundational system. It is still the single source for the governed data that is essential for decision-making. It is the business’s memory: a repository of its history, of its changes over time. It provides a panoptic view of the business and its operations. But data warehouse architecture was designed with certain core assumptions in mind.
These expectations are no longer valid for all consumers and all use cases at all times.
“Currently, we have on-prem data sources, we have batch processing, we are writing into a data warehouse. You can still go there – you can still have your own-prem sources … they are not going away,” stressed Vinayak Dave, professional services manager with Equalum, a vendor that develops a change-data-capture (CDC) data integration platform. “But how do we combine writing into the data warehouse, as well as writing into the cloud, so you can write into [Amazon] Redshift, write into Snowflake or any other S3 bucket, and have a solution that allows you to do everything.”
Dave’s description of the problem only scratches its surface. In actual fact, Equalum and other vendors are focusing on a complex of related changes. The first is that useful data, compute, and storage resources are always distributed: this is a feature, not a bug, of loosely coupled software architecture – as well as the bedrock principle of cloud infrastructure. The second is that analytics no longer comprises a homogeneous set of practices centered on a monolith: the data warehouse. New analytic practices have emerged that make use of new tools and techniques. Third, some of these new tools and techniques require access to right-time, as distinct to periodic, data: they expect to consume data immediately after it is created.
Other key changes are that, fourth, a fluxional, data-in-motion paradigm has supplanted the legacy (static) data-at-rest paradigm; and that, fifth, this new data-in-motion paradigm is bound up with the emergence of a related shift – one in which orchestrated, event-driven process workflows have replaced the traditional reliance on scripted, tightly choreographed workflows. (This is really the essence of what we mean by “digital transformation.” It is also consistent with a conceptual shift from service choreography to service orchestration, e.g., in cloud-native software design.) This leads to a sixth key change: namely, that organizations now depend on new technologies, such as stream processing buses, data fabrics, and API-based data exchange mechanisms to query, acquire, and/or engineer data.
Forward-thinking organizations are already designing distributed data architectures on these principles, Dave noted. “According to a Gartner report, by 2023, most of the organizations, a majority of them, will be using multiple types of data delivery, meaning they may be doing some ETL [and] a lot more streaming, near-real-time data ingestion,” he said. “Also, they’ll be doing some classic CDC replication, writing data from [for example] files to Snowflake or from SQL Server to MySQL.”
The tools and techniques of legacy data warehouse architecture cannot get the job done
As Dave explained, conventional, ETL- or ELT-based data integration is dictated by the cadence of the batch processing model: every n seconds, minutes, hours, etc., a data integration tool acquires data from OLTP databases and other upstream producers, engineers it to conform to a relational data model, and loads it into the data warehouse. Until recently, this warehouse was usually instantiated in a relational database. This is the first outmoded assumption: i.e., useful data is not always or primarily strictly structured, as with relational data extracted from OLTP systems; businesses now generate and expect to analyze large volumes of non-relational data, too. For example, tabular data embedded in software messages, device logs, sensor events, etc. Even if it were cost-effective to store this data in an RDBMS, other changes militate against doing this.
The second outmoded assumption has to do with how, where, and why data engineering takes place.
In legacy data warehouse architecture, data engineering usually takes the form of sequenced extract, transform, and load (ETL) operations. (A refinement on this is extract, load, transform, or ELT, which, as we shall see, is subtly different.) In ETL, the work of data engineering is orchestrated by a separate tool – i.e., an ETL engine – and takes place in an interim repository. The ETL engine usually extracts all relevant data from upstream OLTP databases and other producers, even if it is unchanged. It loads this data into the interim repository, engineers it, and then loads it into a staging area on the target database. From there, the database ingests the new data. In ELT processing, data engineering shifts into the target database itself, e.g., a staging area in a temporary table. ELT was touted as a solution to several problems, among them the challenge of ingesting data at increasingly tighter batch intervals.
There are a few problems with this. The first is that batch processing is still batch processing. (Even micro-batch processing shrinks the batch interval just so much.) A second, bigger, problem is that ETL/ELT processing was not designed for a world in which data sources are widely distributed – not just between local (on-premises) contexts, but distinct geophysical contexts, too. It is time-consuming and costly to transmit the large volumes associated with ETL/ELT processing over WAN connections.
And the third is that other technologies, such as CDC and streaming, are better suited for right-time access. For example, Dave explained, Equalum’s platform uses CDC technology to capture deltas in real-time and stream them (in right-time) to multiple target repositories in either the on-premises or cloud contexts. Instead of extracting all relevant data, changed and unchanged, moving it en bloc, engineering it, and then moving it again, CDC transmits only changed data. Instead of moving data according to a pre-defined schedule (i.e., the batch interval), CDC transmits changes as they occur.
“You can get the change data capture done in a most efficient way … by reading, only the changes from the source, like redo logs or if you don’t have [local] access to [the database] then you can do something like query-based options and you can query the source database on something like a timestamp and capture the new data that arrives, obviously less efficient but there are ways to do it,” he explained. “Change Data Capture is also impactful because you do not want to impact your production OLTP servers in any way because that’s [something] companies will not allow. If you have more than single-digit percentage of overhead on the OLTP systems, they’ll probably not like it.”
CDC, ETL, ELT, and stream processing in a single platform
Another technology that achieves similar results is stream processing.
In production usage, stream processing requires a dedicated substrate, such as Apache Kafka, the free/open-source software (FOSS) stream-processing bus. However, Dave argued, deploying straight-from-Github Kafka poses a set of non-trivial technology challenges, particularly for greenfield adopters. For example, should you deploy a Kafka cluster across bare-metal servers or virtualized instances? If the latter, what kind of virtualization should you use – hardware or application?
True, application virtualization – i.e., using an orchestration platform such as Kubernetes (K8s) to manage Kafka containers – is more complex, but it’s more flexible, too. In addition to managing the persistent instances of Kafka that provide core stream-processing services for the application and service workflows that underpin essential business services, K8s can spawn new instances of Kafka in response to specific events, such as an API call initiated by a data scientist or a data/ML/AI engineer.
CDC, for its part, integrates unobtrusively with data warehouse architecture and stream-processing. In fact, Equalum integrates Kafka and other core FOSS components – e.g., Apache Spark and Apache Zookeeper – into a data engineering platform that supports CDC, stream-processing, and conventional ETL/ELT processing: all of this traffic transits Equalum’s Kafka-powered stream-processing bus.
Kafka libraries can perform in-flight operations on data as it transits the Kafka bus. This is useful for different kinds of data engineering requirements; for ETL/ELT-like operations, Equalum uses the Spark compute engine. Its value-add with Spark is similar to its value-add with Kafka: it automates the bootstrapping, configuration, orchestration, and management of Spark compute resources.
“We do not rely on any proprietary technology, so our ingestion, for writing into various targets is done by Kafka, which is our messaging bus. We write into Kafka [and] from there we can write to multiple targets, so [Equalum supports] read-once, write-many-times [operation]. And also, we rely on Spark for our execution, so Kafka and Spark are our main building blocks,” he told Kavanagh, adding that Equalum also uses FOSS Apache Zookeeper as an overall cluster-management tool.
On top of this, Dave argued, Equalum aims to simplify the development work entailed in designing, testing, and maintaining different types of data engineering pipelines, from the governed, reusable ETL jobs that feed the data warehouse and/or data marts, to the data pipelines – many of them also reusable – designed by data scientists, data engineers, ML/AI engineers, etc.
He compared Equalum to a no-code tool that exposes different types of drag-and-drop objects that experts can use to design data pipelines. “So, if you want to design something very complex, you can use many of the built-in ETL functions – things like merging the disparate data sources, you can do aggregations, you can do windowing functions, you can split the data into multiple channels, and then again arrive at one target,” he said, while demonstrating this scenario for the webinar audience.
From a reactive, process-driven to an active, event-driven business posture
7wdata’s Mulker made an observation – illustrated with an example – that nicely sums up what is most compelling, intriguing, even fascinating about digital transformation: the logic of digital transformation is the logic of resilience. Mulker didn’t use these words, but the vision he describes is one of active, event-driven business workflows that are resilient in the face of glitches, gotchas, and vicissitudes.
“We’re moving more from a process-driven to an event-driven [model], so if you talk to business people, they think of their processes in a very organized way. So, first step one, then step two, step three, step four, step five. But in reality … your process [workflow] kind of jumps from one department to another department,” Mulker told attendees, describing a sales-order workflow that grinds to a halt because one of its constituent operations – viz., the allocation of packaging material – triggers an error condition. The reason? The business does not have the correct-sized box. In the first place, he notes, the business can and should develop rule- and event-driven remediation logic – e.g., a decision-tree rule that automatically allocates a larger box – to ensure the workflow completes.
In the second place, it can also develop proactive event-driven remediations to address this problem.
“If you can have that more in an event-driven basis where you say, ‘Okay, we’re running out of stock of packaging material,’ you can have that information and that data moved to your supplier and trigger that event at the supplier side,” he argued, contrasting the resilience of event-driven processes with the “rigorous [step-by-step] approach to a process that otherwise stops and is kind of broken.”
About Vitaly Chernobyl
Vitaly Chernobyl is a technologist with more than 40 years of experience. Born in Moscow in 1969 to Ukrainian academics, Chernobyl solved his first differential equation when he was 7. By the early-1990s, Chernobyl, then 20, along with his oldest brother, Semyon, had settled in New Rochelle, NY. During this period, he authored a series of now-classic Usenet threads that explored the design of Intel’s then-new i860 RISC microprocessor. In addition to dozens of technical papers, he is the co-author, with Pavel Chichikov, of Eleven Ecstatic Discourses: On Programming Intel’s Revolutionary i860.