It’s about time to start treating it like one, anyway
If you still approach data engineering as a disconnected series of one-off projects, you’re doing it wrong. Consciously or unconsciously, you’re treating it as its own thing, its own end.
A better frame is to think of data engineering as a means to an end – with the end in question consisting of the product, use case, or practice the engineered data is destined to support.
This is a useful frame for thinking about the practice of DataOps, too: not an end unto itself, a means to an end.
“The main thing that we want to talk about … is really changing the focus in terms of treating data more like a product rather than a project – and what that means in terms of agility and governance and the way in which we embed automated testing,” said Justin Mullen, co-founder and CEO of DataOps.live. “So, you know, we should never again really be talking about building data warehouses; we should be talking about building data products, and the warehouses and the data platforms – everything – are just the things that … we run them on.”
Mullen made his remarks during a recent episode of DM Radio, a weekly, data management-themed radio program that is hosted by analyst and Bloor Group CEO Eric Kavanagh.
DataOps.live develops a DataOps platform that aims to accelerate the development, testing, delivery, integration, and maintenance of data engineering pipelines, primarily in combination with Snowflake’s platform-as-a-service (PaaS) data warehouse.
So, in the first case, the DataOps.live platform includes a code repository that supports versioning capabilities for data pipelines and their associated logic. Users define pipeline logic in YAML, which the DataOps.live platform interprets to generate SQL. In the second case, DataOps.live helps automate code reuse: users can combine pipeline logic – e.g., reusing pre-built data transformations – to assemble larger, more complex pipelines. (In this sense, the DataOps.live platform could be said to approximate the role of a feature store.) In the third case, organizations can use DataOps.live to run or schedule pipelines in production, to monitor their performance, and to maintain (i.e., modify, deprecate) them as conditions change.
“Think about it from the perspective of … ‘What do I need if I want to treat [data] as a product? What do I need to be able to build and deploy and test this stuff in the same way that we get in [DevOps]?” Mullen said to Kavanagh. “You need to focus on the developer experience. You need to have a code repository behind it. You need to have collaborative experience [so that] developers [can] develop in parallel to develop more [rapidly].” Lastly, Mullen said, “you need the orchestration to run the pipelines and to move the data through” them.
The pipeline jungle
Sean Anderson, head of product marketing with StreamSets, agreed with Mullen: Engineered data is less a product of its own than essential raw material for different kinds of products. The rub, he said, is that it is more difficult to produce this raw material today than it was a decade ago.
Anderson told Kavanagh that two related factors complicate the task of engineering data for use as raw material with different types of “products.” The first has to do with the fact that the relational database is no longer the primary source of useful data: organizations acquire data from key-value stores, sensors, application messages, logs, and so on. What is more, these sources are usually distributed between the on-premises enterprise and public cloud infrastructure.
So instead of scheduling SQL-based data engineering operations on a single platform (an RDBMS) or across multiple instances of the same type of platform (again, an RDBMS), modern data engineering pipelines must coordinate operations between and among heterogeneous endpoints, some of which are local to the on-premises environment, most of which live in the cloud. Moreover, data pipeline logic must identify those operations that can execute in parallel or concurrently, as well as those that are dependent on others, scheduling and sequencing both types appropriately.
The second has to do with the emergence of new types of consumers who expect to use new types of tools to work with data that is now distributed by default. These consumers specialize in new practices, such as machine learning (ML) and artificial intelligence (AI) engineering. Businesses are having a difficult time standing up these practices as it is, Anderson told Kavanagh; they’re having at least as much difficulty scaling up the continuous delivery (CD) and continuous integration (CI) practices – DataOps, MLOps, etc. – required to support these practices in production.
StreamSets’ DataOps platform aims to simplify and, so far as possible, to automate this scaling, he explained.
“What generally doesn’t scale to that same degree is the data engineering,” he told Kavanagh. “Now we’ve entered into an era where data pipelining really needs to be not only a continuous effort – we can’t just build a pipeline and set it and forget it, we need to upgrade over time,” he continued. “As we migrate to cloud platforms … the pipeline needs to be decoupled in a way that allows it to be portable across those different use cases and allows the user to evolve with that.”
Absent a DataOps platform, the sheer scale of supporting CD and CI with tens of thousands of data pipelines in production can easily overwhelm a large organization, Anderson argued: “The companies that we interface with – large pharmaceutical companies like GlaxoSmithKline, big oil and gas companies like Shell – they’re operating millions of pipelines.
“That whole operations part of the picture is now really becoming the tricky part. The development, you know, is a problem that I feel like … has been really well solved for, but [the bigger problem is] how do you actually operationalize that – how do you scale that and operate, you know, to the order of magnitude that some of these … people are seeing.”
Power to the (non-expert) people!
David Mariani, founder and CTO with AtScale, comes at the same problem from a slightly different angle. Whereas DataOps.live, StreamSets, Astronomer, and other vendors focus on automating the development, testing, delivery, and integration of data pipelines for data engineers, data scientists, and similar experts, AtScale, by contrast, focuses on simplifying access to data for different kinds of non-technical consumers.
To this end, AtScale’s platform provides a semantic layer for distributed data. In this sense, it incorporates elements of both a classic business intelligence (BI) platform and so-called data virtualization technology. (Data virtualization is one of the core enabling technologies of a data fabric.) The idea is to mask the complexity involved in getting at data irrespective of where it is stored or how it is exposed. So, for example, AtScale’s Universal Semantic Layer aims to permit transparent access to data whether it is stored in the on-premises enterprise or the off-premises cloud. It likewise masks the interfaces used to access and manipulate this data, as well as the engineering it needs to undergo before different kinds of consumers can work with it. As far as the consumer is concerned, all relevant data gets presented in a single view, even if the elements of that view are integrated from sources spanning both the enterprise data center and the cloud.
“Some of our best clients or customers … got rid of the data analyst job title, because their goal is to have everybody be a data analyst,” Mariani said. “and to really realize that vision … we have to allow everybody to access and … be able to be productive with data, which means you can’t make them understand how to write SQL: it’s much easier to teach a business expert how to use data to make decisions than to teach a data analyst or a data engineer about the business.”
AtScale’s Universal Semantic Layer works a lot like a BI semantic layer. First, it translates technical metadata (information about tables and columns, for example) into relevant business definitions. Second, it permits expert users to design “business views” that encapsulate business logic: e.g., hierarchies, rules, calculations, etc. A report is, in effect, a kind of concrete business view, as is an OLAP cube. Business views can be customized for different types of users, use cases, applications, and so on. AtScale’s data virtualization layer permits experts to build business views consisting of data derived from dispersed sources.
In addition, the semantic layer provides a means of enforcing user access control, creating and tracking data lineage information, and other critical aspects of data governance. “Our whole point is to make everybody a data analyst: forget about making data only accessible by experts,” Mariani said. “That means, allowing them to consume [data] the way that they’re comfortable consuming, but allow them to do it with consistency and governance and performance.”
About Vitaly Chernobyl
Vitaly Chernobyl is a technologist with more than 40 years of experience. Born in Moscow in 1969 to Ukrainian academics, Chernobyl solved his first differential equation when he was 7. By the early-1990s, Chernobyl, then 20, along with his oldest brother, Semyon, had settled in New Rochelle, NY. During this period, he authored a series of now-classic Usenet threads that explored the design of Intel’s then-new i860 RISC microprocessor. In addition to dozens of technical papers, he is the co-author, with Pavel Chichikov, of Eleven Ecstatic Discourses: On Programming Intel’s Revolutionary i860.