It’s easy for an enterprise that uses cloud services to feel as if it has been transported back in time.
Back to the early 1980s, to be exact, when the concept of a central, time-variant repository for business-critical data, also known as a data warehouse, had not yet been codified. In this pre-decision-support era, business units owned their own data and were under no obligation to share this data with other business units. Until the data warehouse came along, they had no feasible, reusable, governable way to share their data even if they wanted. On top of this, useful business data was almost always destroyed: whenever operational apps updated their records, they overwrote existing values.
This data was lost forever.
This is also a good description of the data management status quo in the cloud. In almost all cases, current data is maintained by the cloud services that created it. Some cloud services preserve data history; most do not. Besides, subscribers have no easy way to contextualize current or historical data across all business functional units, subject areas, etc. They can see the trees, but not the forest; the synoptic view of the business the data warehouse provides has no out-of-the-box analog in the cloud.
A recent episode of DM Radio, a weekly, data management-themed radio show, explored this topic.
Host Eric Kavanagh convened a panel of guests to discuss the retro-like feel of data management in the cloud. At times, the content of the discussion seemed straight out of the era of VisiCalc.
“We’re all collecting data … but here’s the thing: we have so many different connectors now, you know, and problems within departments – like, the HR department has their own system, finance has their own system. Everybody has their own little bucket of data. So, whenever we start out trying to put that together is when it becomes difficult,” says Carla Gentry, a product evangelist and data scientist with Zuar, a company that develops Mitto, API-based ETL and data-staging software for the cloud.
Sure, Gentry notes, businesses can sign up for a cloud data warehouse service, but how do they prepare and move data “out of” a diverse menagerie of distributed cloud apps and into this data warehouse? In the cloud, after all, the default mechanism for data interchange is the API, instead of a SQL interface; SQL interfaces (such as ODBC and JDBC) are standardized across vendors and implementations in precisely the way that APIs are not. There’s also the fact that the same APIs that an enterprise might use to access, move, prepare and load data into its cloud data warehouse are also accessible by other types of consumers. So, citizen data scientists, business analysts and similar tech savvy users are able to extract and prepare their own data sets from the cloud. They create and share data sets without regard to where the data came from, when it was created, what was done to it, etc.
Spreadmart hell revisited
The problem is that other people use this data to make decisions and/or business plans, as well as to formulate business strategy. In one sense, it’s a recipe for a return to what analyst Wayne Eckerson famously dubbed “spreadmart hell:” a situation in which everybody has their own spreadsheet (or its equivalent) and nobody can agree on even basic questions. In another sense, it doesn’t even try to achieve anything like a synoptic view across all business function or subject areas.
“We’re sharing and passing this information back and forth, and … now we have multiple copies everywhere. What happens when I update this copy? Right? It doesn’t update the rest of the copies,” Gentry told DM Radio listeners. She contrasted this chaotic model with an incrementally more governed model in which business analysts use Excel spreadsheets with preconfigured external data connections to automatically refresh data from a Snowflake data warehouse. The Snowflake warehouse provides a synoptic view across all of the business’s operations. Everybody can access it at the same time; everybody can query against the same data: “If you don’t have the ability to query companywide, if you don’t know what each department is doing … you’re wasting time and money.”
Exploring new solutions to a very familiar problem
This dysfunction breeds dysfunction of another kind, too. For example, sometimes it is necessary to incorporate historical business data into operational apps. In the on-premises context, this was simple enough: packaged operational apps usually supplied SDKs that enterprises could use to design workflows (or build custom apps) to incorporate data from external sources, such as data warehouses.
In the cloud, this is much more difficult, so much so that organizations have had to make use of counterintuitive schemes, such as reverse ETL, to feed historical business data back into their cloud apps. One compelling solution to this problem is the data fabric. A new wrinkle on this is so-called data mesh architecture. The data mesh tries to sustain the “good” parts of data management in the cloud: business groups continue to own and control their own cloud data, for example. The data mesh uses data fabric technology to knit together the data owned by these groups. It decentralizes data access.
Zuar CEO Whitney Myers touched on the promise of the data mesh in response to a question from Kavanagh. “[The data mesh] kind of seems to be a new perspective on a problem we’ve been trying to solve for a while, which is, how can we quickly get the right data into the hands of users in a way that they can understand?” she noted. Just as there are different ways to slice a pomegranate, there are different ways to deal with the fact of data distribution in the cloud and to accommodate the priority of local autonomy, Myers argued. “So, what we’ve tried to do is create a very fancy term, ‘skewed schema,’ where [we perform] auto normalization, which means we learn the data as we ingest it.”
To find out more about what Zuar means by “skewed schema,” check out the rest of the DM Radio broadcast. In addition to Zuar’s Gentry and Myers, Kavanagh was joined by Barry Golumbek, director of sales with SportsDataIO, a company that exposes API services for streaming sports-related data in real-time. Subscribers can use SportsDataIO’s APIs to create live play-by-play scoreboards (complete with up-to-the-second statistics and other relevant information) or to support real-time gaming. SportsDataIO also provides aggregated split stats, along with daily fantasy sports (DSF) slate and salary feeds. The excerpts discussed in this article only scratch the surface of a fascinating discussion!
About Vitaly Chernobyl
Vitaly Chernobyl is a technologist with more than 40 years of experience. Born in Moscow in 1969 to Ukrainian academics, Chernobyl solved his first differential equation when he was 7. By the early-1990s, Chernobyl, then 20, along with his oldest brother, Semyon, had settled in New Rochelle, NY. During this period, he authored a series of now-classic Usenet threads that explored the design of Intel’s then-new i860 RISC microprocessor. In addition to dozens of technical papers, he is the co-author, with Pavel Chichikov, of Eleven Ecstatic Discourses: On Programming Intel’s Revolutionary i860.