“Reverse ETL” Seems Like Just the Thing to Get at Your Data in the Cloud

Do you ever get confused by all of the new terms and trends in tech, especially in data and analytics?

For example, when you hear a term like “reverse ETL,” do you automatically think: “Um, LTE? Isn’t that a telecommunications standard?” Or when you hear an acronym like “ELT,” does your gut tell you the person must have misspoke? Have you asked yourself: Do these words have real, actual meanings?

Or is all of this just noisy tech-marketing folderol?

How about a little from Column A, a little from Column B?

What if “reverse ETL” is a marketing term, and what if it also describes a common problem for which tech entrepreneurs have devised new solutions? And so what if “ELT” seems like a lapsus linguae, an accidental transposition of letters? Maybe it also designates a new approach to a familiar problem?

Host Eric Kavanagh discussed these issues with several panelists on a recent episode of Inside Analysis, a weekly, data management-themed radio program. One of Kavanagh’s guests was Zack Khan, a founding member of start-up Hightouch, which develops and markets reverse ETL software and services. Khan, who heads up marketing for Hightouch, broke down what’s at stake with reverse ETL.

“The challenge is … [that] you have this awesome warehouse, you’re probably creating awesome data models … [and you’re] combining all that data to make useful insights. But today, they’re pretty much trapped in that warehouse,” Khan explained to Kavanagh, arguing that traditional self-service analytic discovery tools (he cited Google’s Looker) offer, at best, a Band-Aid solution to this problem.

And what, for the love of Elon Musk, is this problem? Khan framed it, succinctly, as follows: the data and modeling logic you instantiate in your data warehouse is accessible to just a small proportion of consumers. Individuals are “using [this data] for dashboards: so you have a Looker dashboard that your teams look at,” he told Inside Analysis listeners. “But really, that data is not really being used outside of that. And so we see [in this] last-mile problem the call for a modern data stack with reverse ETL”.

This “last-mile” problem in analytics is arguably being exacerbated, instead of improved, by easy-to-get-started-with cloud platform-as-a-service (PaaS) data warehouse solutions like Snowflake.

This isn’t to ding Snowflake. Khan’s point is that services like Snowflake make it easier for individuals to integrate data and design data modeling logic to support new types of dashboard views, new types of analytics and so on. The problem is these new assets are available to just a small fraction of potential consumers! The problem has to do with creating, in effect, reverse integration, or reverse ETL, out of the data warehouse in order to make data and modeling logic available to cloud apps.

“Reverse ETL is moving data from your data warehouse, like Snowflake, into all of your business tools,” he said, using the Salesforce cloud as an example. “Let’s say for your sales team, they need product-usage data. They need to understand how customers are using their products in Salesforce so they can customize their outreach …. All of this stuff is powered through this reverse ETL process.”

How does it work?

“Reverse ETL” software such as Hightouch runs on top of (so to speak) a customer’s data warehouse.

Okay, you ask, but how does that work? Khan explains it as follows. First, he notes, the good thing about Hightouch’s scheme is that its software neither stores nor manages the customer’s data. The data warehouse does that, as it should. So, you aren’t duplicating data, and you don’t have to worry about maintaining and paying for separate stores. “We don’t store any of your data. All we’re doing is, first … querying your data. So typically, you enter a SQL query, right? Or we can read your existing models … with a tool like DBT,” he told Inside Analysis listeners. “And the next piece is basically we send that data that you defined in your model to a business tool. There’s no storage happening on our end.”

If this sounds deceptively simple, that’s because it is. In the background, Hightouch does a few things.

It begins, Khan explained, with a basic SQL query. From there, “the second piece is basically defining a mapping towards … the data model, and the fields that you want to update in your [destination] tool.”

In other words, Hightouch queries against one or more predefined models and then surfaces the results of this query “in” the appropriate downstream application – say, Salesforce.

“There’s … a field you want to update, which is like lifetime value of a customer, or how many times they’ve logged into your app in the last week, or what parts of the website they’ve visited,” he said. “You can map … let’s say, I have a column in my SQL query called ‘lifetime value.’ And I map that to the Salesforce field called ‘lifetime value of customer.’ So, we have a simple UI for you to do that.”

The “ETL” part of “reverse ETL” is more than just clever expropriation of a preexisting term. In the first case, Hightouch really is performing the equivalent of reverse ETL in extracting data (the SQL query), transforming it (by forcing it through a model/models), and then loading it into a downstream tool.

In the second place, another critical aspect of ETL is that ETL jobs are supposed to be reusable and (to this end) schedulable. An ETL job is “governed” in the sense that it is supposed to behave exactly the same way each and every time it runs. If it doesn’t behave exactly the same way – if something goes wrong – it is designed to trigger an alert and/or attempt to recover. (By restarting the job from scratch or by reverting to a step in the data-processing pipeline prior to where the error occurred.)

Hightouch implements ETL in this second sense, too. “The last piece is defining how frequently you want this thing to run. Let’s say you wanted to run hourly, daily, every five minutes. And that’s it,” he explained. An added bonus is that Hightouch doesn’t have to persist data with the downstream service. “From a privacy standpoint, you don’t have to worry about any kind of store if you don’t want to store data, we don’t need to,” he said “It’s purely just … having the [data in the] warehouse available to all of your business teams, because [otherwise] business teams wouldn’t have access to it.”

Hey cloud providers: You missed a spot!

If this sounds like the kind of problem that shouldn’t be a problem … well, who could argue with that?

In the on-premises enterprise, business applications and workflows query the warehouse all the time.

In fact, providers of packaged business application software give organizations tools and expose APIs they can use to design what Hightouch calls “reverse ETL” integrations from their software to the data warehouse. They provide software development kits (SDK) organizations can use to build custom apps that query both their own business apps and the data warehouse. They do this because these apps need to get data from the warehouse, whether they’re packaged (ERP or CRM) or custom-built.

In the cloud, this is hit or miss. Salesforce is an especially egregious example: data must be accessible “inside” the Salesforce environment if Salesforce apps are to access it. Hence the need for something like reverse ETL to enable customers to do something they not only were used to doing in their on-premises environments, but which common-sense, the requirements of business, and the incentive structures of a free market (in which customers put pressure on vendors/providers to accommodate their needs) should have produced.

What went wrong – and why? The answer to this question is fodder for another article!

The previous discussion of reverse ETL just scratches the surface of a fascinating episode of Inside Analysis. If what you’ve read so far interests you, be sure to check out the rest of the Inside Analysis broadcast. Host Eric Kavanagh and guest Zack Khan drill down more into reverse ETL, accessing data in the cloud, building analytics, and other issues

About Vitaly Chernobyl

Vitaly Chernobyl is a technologist with more than 40 years of experience. Born in Moscow in 1969 to Ukrainian academics, Chernobyl solved his first differential equation when he was 7. By the early-1990s, Chernobyl, then 20, along with his oldest brother, Semyon, had settled in New Rochelle, NY. During this period, he authored a series of now-classic Usenet threads that explored the design of Intel’s then-new i860 RISC microprocessor. In addition to dozens of technical papers, he is the co-author, with Pavel Chichikov, of Eleven Ecstatic Discourses: On Programming Intel’s Revolutionary i860.

“Reverse ETL” Seems Like Just the Thing to Get at Your Data in the Cloud

About Vitaly Chernobyl

Related Articles: