The use of third-party or “alternative” data in analytics is not a new thing.

There’s a sense, however, in which it is a newly feasible thing. 

Today, more alternative data than ever before is being generated; at the same time, companies have access to cost-effective tools they can use to ingest and analyze this data. For better and worse, then, it is possible to speak of measurable demand for alternative data. “Historically what have companies done to understand their business, their customers, their products, how things are going, that’s typically with … first-party data, meaning we have data on what we sold to whom, when, etc.,” Kavanagh told DM Radio listeners. Broadly speaking, alternative data is external, or third-party data, he explained.

Like it or not, there is a market for this data – one that users or subscribers opt into by default, simply by using an app or subscribing to a particular service. As Kavanagh described it, this data is sold and resold to private companies, non-profits, and sometimes local, state, or federal government agencies.

Some researchers project that the market for saleable data will explode over the next half decade; similarly, Gartner is now advising some of its customers about data monetization strategies. One driver is that companies are collecting more data than ever before from customers who have opted into data collection – usually be default. On the one hand, buying and selling this data is legal, and use of this data – especially when it is integrated with first-party data – is potentially valuable. On the other hand, the onus is on companies to use this data legally, ethically, and – how else to put it? – non-creepily.

Saket Saurabh is co-founder and CEO of Nexla, a data management start-up that specializes in integrating and managing alternative data. He sees interest in and curiosity about alternative data as in part driven by uptake and success with analytics: a company that successfully gleans insights from its own first-party (internal) data naturally gets curious about other potentially useful sources of data.

“The immediate question then become for people … what other data is out there that can help me make those decisions that are helping me run my business better,” he told Kavanagh.

Companies that want to make use of alternative data should expect to encounter a number of familiar challenges, Saurabh told Kavanagh. “When you think about … alternative data, the first thing is … where is it, what am I looking for? The immediate thing after that is: How do I integrate with it? How do I make it usable? How do I bring it into my system? How do we model it and everything else right?”

If you collect and analyze it, the insights will come?

The thing is, alternative data is everywhere – and much of it is free. Marta Lopata is co-founder and chief growth officer with Thinknum Alternative Data, a startup that helps businesses identify and analyze useful sources of alternative data. In response to a question from Kavanagh, Lopata brought up a particularly compelling alternative data use case from just this past spring: the infamous GameStop short squeeze. She says her company uses “crawlers” – basically, automated software robots – to trawl the web harvesting data about hundreds of thousands of different companies. 

As the GameStop kerfuffle demonstrated, social media – and quasi-curated sites such as Reddit, in particular – is a potentially important resource for useful information about stocks. 

It makes sense. During the dot-com implosion and its aftermath, alert investors and some analysts took to crawling proto-social media sites such as [censored], looking for information about vulnerable stocks. In l’affair de GameStop, Reddit cemented its status as a go-to data source.

“We actually started crawling Reddit in August 2020, so like half a year ahead of this entire, you know, situation that was happening [in] January [of 2021] … and we started basically crawling stock mentions on all the subreddits that are related to finance to understand what stocks are being talked about the most, and kind of create on a system to track social-media mentions,” she told Kavanagh.

“This [is] getting real momentum in terms of risk due diligence for the investment community,” Lopata said. “And so early on this year, we received a lot of interest from institutional investors … that are now using Reddit data sets to track what the retail investors are talking about and assessing how that could possibly … influence their positions … and that’s kind of a new application of alternative data.”

Even though it has a long and not-so-obvious history, 2020 was a break-out year for alternative data, said Mark Fleming-Williams, host of the Alternative Data Podcast and communications director with Exabel, a company that specializes in helping investors make use of alternative data. This is mostly thanks to SARS-CoV2, he suggested. “[I]n this crazy year when nothing made sense and you couldn’t compare to previous years at all what was going to happen in the market, … [you could analyze] credit-card transaction data, [which] is released a week after the fact,” he told Kavanagh.

As Fleming-Williams puts it, even though it was obvious that, for example, food-services vendors were losing money, analysis of credit-card data permitted investors to determine just how much they were losing – sometimes months before these companies were due to release their quarterly financial statements: “So you could actually see what’s going on in Chipotle with the fact that nobody’s going to Chipotle … because of the … pandemic, so it was incredibly valuable for … investors during that time.”

An increasingly level playing field …

Finance was out in front of the alternative data trend, but Nexla’s Saurabh says companies in all verticals are open to purchasing or otherwise acquiring data from third-party sources, in addition to using data available via the web. “We are seeing this across the board,” he told Kavanagh, explaining that two of Nexla’s reference customers – InstaCart and PoshMark – both compete in the e-commerce space. What is more, analysis of alternative data is common in the cybersecurity space, too.

In general, Saurabh argued that alternative data is especially useful in the context of experimental or exploratory analytics practices, such as data science and machine learning (ML) engineering.

“[D]ata scientists … come up with an idea and say, you know … I think [demand] is affected by these things outside [the company] as well. That’s an idea you have, but to test that idea, you need the [outside] data,” he continued. “So, the big challenge for many companies then becomes … how do we go from that idea to that data that can help me prove that idea – and, yes, [prove] that it actually is effective? Then, how do I start to do this on a regular basis?”

… with surprisingly low barriers to entry 

In his discussion with Kavanagh, Saurabh echoed the insight of Thinknum’s Lopata, who adverted to the widespread availability of free alternative data, e.g., via social-media sites such as Reddit.

In the future, he predicted, it will be common for business partners to exchange data with one another, for suppliers and their customers to exchange data, and for companies to exploit existing business relationships to acquire and trade data. “[W]hen two companies work together, more and more often I think it will become even more common, you know there will be a flow of data between companies because that’s just going to be how they will work together, right?” he told Kavanagh. 

“Oftentimes data is just natural part of your business agreement or arrangement with companies, you know because you’re a supplier to them you’re a merchant to them or, or they’re in your marketplace.”

But not completely devoid of challenges

Of course, accessing this data is one thing: ingesting it; cleansing it; modeling and storing it for fast, efficient retrieval; and, not least, analyzing it is something else again. So, too, is governing it: not only managing data access, distribution, and manipulation, but producing (and maintaining) data lineage information, generating (and maintaining) metadata, and so on. The last and most important piece has to do with the responsible or ethical use of alternative data: applying masking or anonymization techniques if and when appropriate, identifying both ethical and unethical use cases, and so on.

According to Saurabh, the data management dimension of this challenge – i.e., ingesting, storing, modeling, maintaining, and governing the use of alternative data – is the métier of Nexla’s platform. 

“The thing is that the data is not easily consumable in high volume from many places,” he told Kavanagh. “You know, I can look at the records and say ‘Oh yeah, that’s data,” but for a machine or for a system, there has to be a lot more [intelligence]. So that’s kind of where Nexla comes in.”

But these issues are fodder for a series of more detailed, more technical stories. Stay tuned.

About Stephen Swoyer

Stephen Swoyer is a technology writer with more than 25 years of experience. His writing has focused on data engineering, data warehousing, and analytics for almost two decades. He also enjoys writing about software development and software architecture – or about technology architecture of any kind, for that matter. He remains fascinated by the people and process issues that combine to confound the best-of-all-possible-worlds expectations of product designers, marketing people, and even many technologists. Swoyer is a recovering philosopher, with an abiding focus on ethics, philosophy of science, and the history of ideas. He venerates Miles Davis’ Agharta as one of the twentieth century’s greatest masterworks, believes that the first Return to Forever album belongs on every turntable platter everywhere, and insists that Sweetheart of the Rodeo is the best damn record the Byrds ever cut.