Confidentially, many chief data officers will admit that their companies suffer from what might euphemistically be called “data dyspepsia:” they produce and ingest so much data that they cannot properly digest it.
Like it or not, there is such a thing as too much data – especially in an era of all-you-can-ingest data comestibles. “Our belief is that more young companies die of indigestion than starvation,” said Adam Wilson, CEO of data engineering specialist Trifacta, during a recent episode of Inside Analysis, a weekly data- and analytics-focused program hosted by Eric Kavanagh.
So what if Wilson was referring specifically to Trifacta’s decision to stay focused on its core competency, data engineering, instead of diversifying into adjacent markets. So what if he was not, in fact, alluding to a status quo in which the average business feels overwhelmed by data. Wilson’s metaphor is no less apt if applied to data dyspepsia. It also fits with Trifacta’s own pitch, which involves simplifying data engineering – and automating it, insofar as is practicable – in order to accelerate the rate at which useful data can be made available to more and different kinds of consumers.
You load 16 tons and what do you get?
One problem with data, as with just about any raw material, is that it must be engineered before it can be used. A separate problem, one that’s unique to data, is that it tends to accrete so rapidly it becomes practically impossible for unassisted human experts to engineer it.
Imagine a gravity conveyor belt with a capacity of 200 tons per hour … feeding coal to a trio of human stokers who, between them, can shovel at most six tons per hour. This is the sisyphean task of the modern data engineer or ETL developer: she cannot ever be caught up with her work because she is forever falling behind.
Tim Hall, vice president of products with time-series analytics specialist InfluxData, made a similar point during a recent episode of DM Radio, which Kavanagh also hosts. “We have way too much data and I think people are collecting things they don’t actually need, want, or use, and that actually creates a worse problem, which is [businesses think] ‘Now storage is cheap, so I can store it all’ … but figuring out what’s actually important to you is the challenge,” he said.
But Wilson and Trifacta differ from Hall on one point: Hall’s argument is that businesses are ingesting more data than they can process, which makes it difficult for them to figure out which data, in which combinations, is important. This prevents them from using it. To remedy this, Hall thinks businesses need to go on a data diet.
Wilson and Trifacta are saying something different. According to them, it isn’t quite accurate to say that businesses are ingesting more data than they can process; rather they’re ingesting more data than they can humanly digest. What the average business needs is the equivalent of a food processor for its data.
“We work with a big investment bank, and they have about 236 quants … and they had about 10 people that were responsible for the data lake, and these 10 people had a backlog of work – requests for different kinds of data shapes and structured in different ways – that was probably measured in years,” Wilson told Kavanagh. “They said … we can’t hire our way out of this problem because we don’t have the budget and even if we had the budget, finding the talent is super hard.”
This company’s solution was to open up the raw zone of its data lake to the quants themselves.
“They wanted to let the quants go in, who were very data-savvy, very data-driven individuals, and create their own training data sets on their own by stitching together all kinds of … data,” he explained.
Trifacta’s software is more than just a food processor, however; think of it as a kind of data engineering gastroenterologist: after all, Trifacta invests a portion of its R&D into studying how businesses use data, e.g., researching which tasks or operations can be fully automated, which can be partially automated, and which can be accelerated by means of guided features.
“We want to try to figure out how to apply [machine learning] to the data for the purposes of cleansing it, standardizing it, shaping it, structuring it, and really welcoming less technical users … into the process of refining raw data,” Wilson explained. “If we can put the people who know the data best front and center … in that process, regardless of what their level of technical acumen is, and if we can turn that into a user experience powered by machine learning, that really helps automate a lot of the complicated things.”
Give the people what they want
The idea of equipping savvy consumers to prepare their own data is comparatively non-controversial. A decade into the self-service data prep revolution, that issue, at least, is settled.
That said, the backdrop to this idea is usually the presupposition that consumers will access and prepare data that is in some way the product of governed data engineering processes, as with conventional ETL jobs. According to Wilson, however, this isn’t always or even usually the case. Not anymore. Not that it ever was.
In the first place, he notes, the focus of data engineering has shifted from the on-premises data center to the cloud. Concomitant with this shift, the concepts, methods, and tools of data engineering have changed, too. Strictly structured relational data is no longer regnant; consumers now work with semi- and multi-structured data, too. This data is generated by a diversity of sources, only a small fraction of which are traditional producers, such as relational databases.
These are just a few changes. More fundamentally, the primary site in which data is ingested; the modes by which it is transported and made available for ingest; and, not least, the shapes and sizes into which people expect to wrangle it—: all of this has changed, not so much because of but along with the shift to cloud.
Increasingly, then, businesses are keen to give special consumers access to not-so-strictly-governed data – for example, as in Wilson’s anecdote, data that lives in the raw zone of a data lake. Their reasoning is that these special consumers are also those who are most familiar with the data, its characteristics, and its problems.
This last idea is controversial, however, as Kavanagh’s other guest, Sanjeev Mohan, former Gartner analyst and current CEO of SanjMo, pointed out. “The quants going straight to the raw zone – that scares me a bit,” he told Kavanagh, explaining that a data lake is usually subdivided into logical zones, such as a raw zone (a schema-less landing and staging area for data), a curated zone (for modeling and creating advanced views of data), and a consumption zone. Mohan has misgivings about giving just anyone access to raw data.
“I have been telling my customers not to dip into the raw zone because [the business] may have information there that has not yet been identified, and may have stuff that we cannot really make sense of,” he explained. “So what I’ve been seeing is that you use your raw zone as a landing zone or staging zone to bring all the data in, then you curate it and put your subject-matter experts on the curated zone, because now you have a metadata catalog, you have some sort of lineage, you may have identified data.”
Mohan’s view is not dogmatic, however; in fact, he sounded out Trifacta’s Wilson as to his own opinion on this practice. Wilson, for his part, believes that companies will do what they need to do to solve emergent problems. In some, probably special cases, this could entail giving certain kinds of expert users early access to data at the point of ingestion; for most cases, data in the curated zone will probably suffice. “GlaxoSmithKline give[s] our product to their scientists not [their] data scientists. Their chemists are doing a lot of last-mile stitching together of clinical trial data with experiment[al data] and asset data and medical records data … coming off of devices like inhalers,” he said.
“And in that particular case, you know, the data has gone through a certain amount of refinement and staging – even putting it into some canonical forms in a model of some sort – but the model … is never quite what somebody needs [in order] to do their last little bit of analysis,” Wilson continued.
“In my financial services example, that was a case where the data wasn’t very sensitive and the time-to-value was incredibly important, and in some sense starting in the raw [zone] had utility given the context and the types of algorithms and training data sets that they were looking to create.”
Get onta my cloud
Trifacta’s own evolution reflects these changes. It started out as a self-service tool for business analysts, data scientists, and other expert users. Its focus was on engineering mostly structured and semi-structured tabular data for use primarily in the on-premises enterprise, typically in combination with on-premises resources. Of course, Trifacta’s software has changed radically over the last decade, but, until recently, the company had not articulated an authentic cloud-native vision for itself. Just a few months ago, however, Trifacta announced its new Data Engineering Cloud.
Wilson describes Data Engineering Cloud as the proverbial idea whose time has come: Trifacta’s customers were demanding it, and a growing share of use cases (as well as analytic practices) now expect it, so the shift to the cloud – and, what is more important, alignment with cloud-native design principles, methods, and technologies – seemed like a logical necessity.
“The sheer variety of use cases that we could tackle, and … both the depth and breadth of what we did had become so much richer and so much broader,” he told Kavanagh. “While at the same time knowing that we are part of a puzzle, and there will be other decisions that are made around us, so we have to play nicely.
“So, if you don’t want to use the orchestration that we build in for your pipelining, that’s fine with us. There’s APIs and you can do it all with [Apache] Airflow or whatever your favorite orchestration engine of choice is. If you want to do all your versioning in Git, that’s fantastic,” Wilson concluded.
About Stephen Swoyer
Stephen Swoyer is a technology writer with more than 25 years of experience. His writing has focused on data engineering, data warehousing, and analytics for almost two decades. He also enjoys writing about software development and software architecture – or about technology architecture of any kind, for that matter. He remains fascinated by the people and process issues that combine to confound the best-of-all-possible-worlds expectations of product designers, marketing people, and even many technologists. Swoyer is a recovering philosopher, with an abiding focus on ethics, philosophy of science, and the history of ideas. He venerates Miles Davis’ Agharta as one of the twentieth century’s greatest masterworks, believes that the first Return to Forever album belongs on every turntable platter everywhere, and insists that Sweetheart of the Rodeo is the best damn record the Byrds ever cut.