If you think about it, we should probably fetishize the role of data in relation to machine learning (ML) and artificial intelligence (AI) engineering – or, for that matter, analytics of any type. That we don’t is a function of the fact that we expect too much from analytics – or, rather, from analytic models.
We assume that if we make our models smart enough, we can control for the deficiencies – e.g., poor conditioning, inconsistency, implicit bias[1], etc. – of the data we use to train them, or against which we expect them to operate in production. This is based on the belief that a sufficiently complex model (or ensemble of models) can extract or, in effect, invent features that are either not present in the input data or which – if present – are obscured as useful signal in the midst of a surfeit of noise.
It doesn’t work that way.
“For all of the hype around machine learning and AI … when you get it wrong, it’s the ultimate twenty-first century manifestation of the garbage-in/garbage-out cliché. If you’re training your models on bad data or even just the wrong data for the prediction you’re trying to make … then you will get the wrong result,” Martin Willcox, vice-president of technology for the EMEA region with data warehousing specialist Teradata, told host Eric Kavanagh during a recent episode of the Inside Analysis Podcast.
Willcox proposed an innovative remedy for this, which we’ll get to in a moment.
First, let’s say a bit more about this outsized esteem for models and modeling.
ML luminary Claudia Perlich described a unique example of this problem in a 2017 interview. She pointed out that models which are too good – i.e., too accurate, too uncannily predictive – are almost always too good to be true. In such cases, Perlich said, the model is probably doing what it was trained to do – without actually doing what it was intended to do. “If your model is getting too good, it’s almost always a problem. There was an example where we built a really good model that predicted breast cancer – except that it didn’t,” she told me. According to Perlich, the model had identified a specific “feature” – namely, a gray-scale image produced by an FMRI scan – that exactly correlated with a positive breast cancer diagnosis. Except, as Perlich noted, the model was too accurate for its own good. “The model looked really, really good at identifying breast cancer” but “it really just learned that people in a treatment center are more likely to have breast cancer than people in a screening center.”
In ML engineering, this problem is known as overfitting: in the example above, the breast cancer-screening model extracted and selected a single predictive feature that is not just highly specific, but exactly correlative; this specificity impedes its ability to “learn” (to make new generalizations) if or when conditions change. (Short of another overfitting glitch, the model could not possibly select one or more new features that have as much predictive power as the “overfitted” feature.) The upshot is that the model is at once incapable of responding to changing conditions and incapable of correcting itself.
Common approaches to dealing with overfitting are to tweak the model – e.g., an ML engineer might tune a model’s regularization parameters to control for overfitting – or to attempt to build more intelligence into the model itself: for example, by identifying and specifically controlling for edge cases.
AI specialist Phil Harvey, co-author of Data: A Guide to Humans, a book dealing with data, AI, and empathy, cites the availability of new tools – such as Fairlearn and Smote – that promise to help address bias, overfitting, and related problems, too. As Harvey sees it, however, a focus solely or even primarily on modeling and coding is akin to the ML engineering equivalent of Whac-a-mole: an exercise in identifying and controlling for biases and edge cases as they’re discovered.
In Harvey’s experience, problems like this are, more often than not, a function of poor data quality: feed your models better, richer, etc. data, and they will extract and select more useful features.
“It’s largely Whac-a-mole with problems like this because most ML engineers just don’t invest in a real understanding of data quality or exploratory data science before diving into ML.”
“If you have good data you can use very simple stuff to get good results. You don’t really need to have really fancy stuff. But that is when the data is ‘good’. Which is very optimistic. All data is but a shadow on the wall of the cave,” he argued, alluding to Plato’s famous allegory of the cave in The Republic.
The quality of the data is, then, the thing. Willcox expanded on this in his discussion with Kavanagh.
“It turns out when you’re building a predictive model [that] actually the input data, the way you arrange those features that build the model, that train the model, that’s far more important to the accuracy of the model than actually the training part itself,” he said. “So if you’ve got a really bad model, typically changing the algorithm or changing the parameters [alone] doesn’t fix it, it’s changing the input data.”
The data pipeline jungle
Willcox is co-author, along with Teradata colleague Chris Hillman, of a new whitepaper called Analytics 123: Enabling Enterprise AI at Scale. In addition to exploring the vital, central role that data plays in ML and AI engineering, Analytics 123 outlines a program for reevaluating the processes, mechanisms, and tools we depend on to produce ML models. Borrowing from an influential 2015 paper, Willcox and Hillman depict a virtual “pipeline jungle” of data feeds, with conditions more nearly reminiscent of an anarcho-capitalistic dystopia than those of a self-service utopia. In practice, they argue, data-engineering pipelines[2] tend to be at once multitudinous and multiparous, with pipelines giving birth to new pipelines giving birth to new pipelines. The result isn’t just a confusing, redundant, ungoverned, irreducibly fragile mess, but one that (for these very reasons) produces data of inconsistent quality.
“If we want machine learning to be predictable … then we’re going to have to change the productivity statistics that we have today, because right now data science teams in general are horribly inefficient,” Willcox told Kavanagh. “We tend to … build one pipeline for every single predictive model we want to build and … what we end up doing in many organizations is going all the way back to source systems or the data lake every single time to get more or less the same data to torture the data to produce more or less the same features to feed to a variety of very similar models in overlapping spaces.”
Willcox and Hillman propose a unique solution to this: what they call a “feature store.”
In essence, their idea is to capture and preserve known-good data pipelines that permit the engineering of common features. In the first place, they say, this is just good governance; in the second place, it promotes reuse, which should cut down on the wild profusion of what are, in effect, ungoverned data pipelines. In the third place, it stands to reason that many features can also be reused across (complementary) domains. Lastly, they see this as the most pragmatic way to scale ML and AI development in the enterprise. “You cannot run millions of models where you’ve got this kind of artisan[al] process where the data scientists start with the source data and they build a pipeline to transform data and then build a model on that pipeline. That one-pipeline-per-project approach … just doesn’t work at scale,” Chris Hillman, data science director EMEA with Teradata, told Kavanagh.
“One of the key things you can do is this idea of the feature store,” he continued, explaining that “when you’ve found stuff that is predictive, that works well in the model, you save it in this entity called a feature store” – i.e., a persistence layer for known-good data pipelines that have applicability across common scenarios. “And then when I come along as another data scientist, it sits in there waiting to be used. It’s been proven” as a training data set for producing known-good features, Hillman says.
Eternal recurrence of the same
There is a conceptual precedent for something like a feature store: the data warehouse itself. The data warehouse is a single, centralized repository for known-good business data. The data integration processes that feed the data warehouse are, in effect, strictly formalized data engineering pipelines.
Or, as a veteran data warehouse architect who spoke on background told me, think of a “feature store” in this account as a data mart of features. “This is the one-time-use OLAP versus multi-use data mart argument all over again,” this person said. “And it might make sense in a commercial context like this.”
Hillman described an analogous use case in his discussion with Kavanagh, referring to the complementary examples of demand forecasting and price optimization in the retail vertical.
“Say Martin builds a demand-forecasting model and I want to look at something like price optimization or promotion effectiveness or something like that,” he told Kavanagh. “Probably 95 percent of that data is the same as what was used in the demand forecasting model. So why would you start again? Why would you go back to the source data and build your own features and your own pipeline?”
This is the multi-use data mart argument in a nutshell. The requisite data is already there: cleansed, consistent, conditioned. Specific features can easily be derived from the data; some of these features will have applicability across complementary business functional areas; some will have global applicability. In other words, save the pipeline-building for the one-off, custom-engineered, niche-y problems – use cases that are analogous to the role of BI discovery in data warehouse architecture.
“The question really does boil down to how one manages core: what is core, how does core change, when does derivative, one-off stuff become core?” the veteran data warehouse architect told me.
For his part, AI specialist Harvey says the that while the idea of a feature store is admittedly intriguing, it likewise strikes him as “somewhat optimistic,” at least from a practical perspective. “To store ‘a feature’ is conceptually ‘possible’, but in reality … it could prove so difficult as to be impractical.”
One obvious problem, Harvey points out, is that “you would spend all your time maintaining that.”
For example, he notes, a data mart is usually instantiated “in” a relational database, which provides a built-in set of essential maintenance services. Data warehouse software provides other essential services. Both of these conceptual pieces would need to be replicated (mutatis mutandis) if something like a feature store were to work. In addition to essential maintenance services, a feature store would need to provide other services (e.g., ensuring that data pipelines were compatible with the specific version of feature store software, updating or deprecating pipelines that weren’t; ensuring compatibility between archived features and updated versions of ML models, etc.) that are not yet available[3].
In sum, the idea has promise, but its fulfillment depends on at least a decade’s worth of technological innovation. Think of it, then, as a conceptual marker – not unlike this paper, first published in 1988.
[1] Don’t forget, however, that data itself can and does encapsulate bias.
[2] For the purposes of this article, think of a data pipeline as a sequence of operations designed to manipulate (or engineer) data, either by transforming it or by combining it with other (transformed) pieces of data.
[3] Although the building blocks for designing services like these are widely used and commonly available.