We live – and thrive – in an era of API-ification. We expose APIs willy-nilly for all sorts of purposes, not all of which are especially well thought out. Nor are we as forward-thinking as we should be about how we use APIs to request data. Like it or not, as information and services continue to shift to the software-as-a-service (SaaS) and platform-as-a-service (PaaS) clouds, it behooves architects, developers, and data consumers to think about and promote the sustainable use of data-access APIs.
“It’s absolutely ridiculous trying to take a terabyte or petabyte of data and put it into a data warehouse through an API,” argued GRAX CEO Joe Gaska during a recent Inside Analysis webinar.
Relax: Gaska isn’t saying it’s ridiculous to exploit cloud APIs in order to request or exchange data.
After all, the only way to access cloud services, much less to extract cloud data, is via the APIs these services expose. Rather, Gaska’s point is that this data integration status quo is unsustainable.
Right now, this status quo is informal and ad hoc – a kind of anarcho-capitalist free-for-all – such that hundreds of consumers might be requesting data from the same cloud APIs at the same time.
So, to cite an everyday example, data scientists, ML engineers, analysts, BI discoverers, etc. are accustomed to connect directly to Marketo, Salesforce, SAP, Snowflake, Workday, and other services in order to access and extract data. Sometimes, these consumers poll API endpoints for data on an as-needed basis: for example, once a day, or whenever they’re actively working with data. Sometimes, they poll for data continuously – dozens of times per minute. This model is essentially decentralized and ungoverned. It is the symmetrical inverse of the centralized, reusable, governed data integration processes that feed the data warehouse, or, for that matter, the governed zones in a data lake.
“There’s a lot of different consumers downstream,” observed Gaska, who believes that too few businesses are thinking about what this mean in practice. He argues that enterprise consumers of cloud services need to start thinking about a more sustainable model for accessing, managing and governing data in the cloud. And because the data-access status quo is inefficient and costly, this is not strictly a technology issue. “That’s more of a business decision than anything,” Gaska said.
What the @#$! Are API rate limits and why do they matter?
Today, it is not unusual for multiple machine or human consumers to poll the same cloud service at the same time in order to request data. This is workable in a scenario in which just a few human or machine consumers are requesting data at any one moment; however, in scenarios in which hundreds or potentially thousands of consumers are concurrently requesting data, things quickly break down.
In the first place, this is wasteful: it puts extra strain on cloud infrastructure and forces providers to incur costs (which get passed on to consumers) in order to accommodate a surfeit of demand. In the second place, it is ungoverned: at any given moment, n consumers might be siphoning (the same) data from the same cloud service and transforming it in different ways to suit different purposes. (But where is this data coming from? How is it being transformed? How will it be used? Is it being persisted? Is it being shared? If so, how – and with whom?) Thirdly, it is costly: if multiple consumers siphon (the same) data from the same cloud services at about the same time, the organization could incur data egress charges. More likely, because almost all cloud providers bill using API-based metrics, the organization will incur unbudgeted API access overage charges. Finally, this scenario is potentially harmful. After all, not all consumers require up-to-the-minute (i.e., right-time) or up-to-the-second (real-time) data. Right now, however, consumers that do require time-sensitive data wind up competing with consumers that do not. This precipitates a tragedy-of-the-commons-like situation: depending on the provider’s policies, it is conceivable that no consumers will be able to get the data they need.
These third and fourth problems have to do with what are called cloud API rate-limits.
Most cloud providers limit the rate at which subscribers can send requests to their API endpoints. As a practical matter, providers usually limit the number of API requests subscribers can make during a fixed period of time. API rate-limits can be hard or soft: for example, some cloud providers permit a fixed number of API calls (say, 1,200) during a fixed duration (say, 120 seconds); if or when an app/service exceeds this limit, the cloud service will just stop responding to the subscriber’s requests; alternatively, the provider may continue to accept API calls but will charge the subscriber a fixed amount for overages. Each provider has its own criteria and policies. (Some providers permit a fixed number of API requests each month – – for example, a maximum of 25 million.) With this in mind, it is critical that organizations understand the behavior of their apps and services, as well as the requirements of the business processes, use cases, or workloads that these programs support.
Towards a better, more sustainable model
To quickly recap, the data-access status quo is a model in which dozens, hundreds, even thousands of human and machine consumers connect directly to SaaS and cloud PaaS services to access data.
This ad hoc model is wasteful, ungoverned, costly, and unsustainable.
Thankfully, we do not lack for would-be alternatives. One of the most common is the concept of a data fabric, which makes use of technologies (namely, data virtualization and metadata cataloging services, among others) that are mature and fairly well understood. An especially intriguing model is that of data mesh architecture, which – to truly do it justice – would merit a detailed deep dive of its own.[1] For the purposes of this article, and at the risk of reinforcing the very type of centralized status quo that data fabric and data mesh proponents like to criticize (with some justification) as unsustainable, this analysis will posit the use of a central repository – something like a data lake – as a way to address the problems associated with cloud data access, management, and governance.[2] For the majority of potential consumers, and for most use cases, it makes sense to centralize cloud data access: that is, to designate a single upstream repository (e.g., a data lake) to act as a kind of access go-between.
The logic is simple enough: the repository ingests data from disparate SaaS and PaaS sources, optionally engineers or cleanses it, and makes it available to downstream consumers. The repository could be hosted on- or off-premises, although most subscribers would probably opt to host it via cloud object storage. (Yes, service providers still charge for API-based access to cloud storage, but – with respect to API rate limits – subscribers tend to have more headroom. For example, AWS permits up to 5,500 GET requests per minute per S3 storage bucket.)[3] It would expose several different kinds of programmatic interfaces – e.g., Python, R, Java, Julia, Scala – in addition to SQL query, so consumers would be able to get at data using their own preferred tools or services. At a nuts-and-bolts level, the repository is populated by a series of governed, repeatable data flows that extract data from cloud services as often as is practicable; in practice, this means fresh data could be made available to consumers almost as quickly as it gets ingested into the repository. This model also permits consumers to create and customize their own data extracts by pre-defining cleansing rules and transformations. Optionally, the repository could be subdivided into different zones that correspond with different data governance and/or data quality regimes: from the less to the more governed. It could also feed other downstream analytics repositories, such as data warehouses and/or data marts.
This description sounds a lot like what is usually called a data lake; again, at a high level, it is.
But GRAX’s unique take on this is to make a virtue out of a necessity. To start with, GRAX recasts cloud data backup as a way to centralize, simplify, and govern access to historical business data.
A different spin on sustainable cloud data access
GRAX’s take on a sustainable model for cloud data access is unique; at a very high level of abstraction, however, it is consistent with the data lake-like model described in the previous section.
Albeit with a clever – indeed, a provocative – twist.
To wit: in both the SaaS and PaaS cloud, subscribers are usually responsible for backing up their data. GRAX’s insight is that this backup process isn’t all that different from – to cite an obvious example – the process of synchronizing and replicating data between a data source and a data target. Today, for example, an organization might sync and replicate data from its on-premises OLTP systems to a PaaS analytics platform such as Snowflake or Yellowbrick; in GRAX’s scheme, subscribers sync and replicate data from the Salesforce cloud to their own cloud storage repositories. To this end, GRAX’s Backup and Recovery service supports highly granular data replication, data synchronization, and high frequency backup between Salesforce and cloud object storage hosted in either AWS or Azure.
Think of this as the necessary aspect of GRAX’s pitch: i.e., cloud data backup is a necessity, GRAX helps simplify it – but does so on the customer’s terms. So instead of backing up Salesforce data to its own service, GRAX permits subscribers to create and manage their own cloud storage. So far, so good.
The virtue aspect of GRAX’s pitch is grounded in the fact that most organizations use their Salesforce data to feed different types of analytic practices, whether it’s a data warehouse (still the workhorse of core operational and production reporting); any of several self-service use cases (e.g., business analysis, BI discovery); or ops-like practices such as ML/AI engineering. This goes to the ingenuity of GRAX’s vision, which is two-fold: first, the backup use-case is mostly identical with (and can be used to feed) the analytics use case; second, an incremental backup service – which preserves different versions or point-in-time snapshots of data history – can be used to power historical business analysis.
In other words, GRAX is proposing a modified-limited end-run around the data warehouse, which, in most organizations, is still the de facto repository for “time-variant” (i.e., historical) business information.
In practice, the warehouse not only ingests and stores operational data, but maintains a history of how this data has changed over time. This historical dimension permits analysts, managers and line-of-business decision-makers, high-level executives, etc. to ask different kinds of business questions about what has changed and why. GRAX isn’t proposing to eliminate data the warehouse, as such, but, instead, offers a pragmatic alternative: namely, a time-variant repository of all Salesforce data, as distinct to just a derived subset of this data. (In practice, just a subset of Salesforce data is captured in the data warehouse. This is a feature, not a bug, of data warehouse architecture.) So instead of going to the warehouse to get the data that they need – data that either is not present (because it is not persisted into the data warehouse) or is unavailable (because it has not yet been provisioned) – BI discovers, analysts, data scientists, etc. can get it directly from the organization’s own cloud backup archive.
This last is not just a kitchen sink of all Salesforce data, but, rather, a point-in-time archive of the history of this data. In other words, BI discoverers, analysts, data scientists, etc. could go to this archive to extract current data, year-ago data, or (assuming it’s available) five-years-ago data to model and analyze trends over time. The benefits of a scheme like this are two-fold: first, if analysts or data scientists require right-time data, it can be ingested and made available in the cloud archive before it can be engineered for and/or ingested by the data warehouse. Second, the archive is home to all Salesforce data, a majority of which will not be captured and preserved in the data warehouse.
There’s something else, too. GRAX’s technology is both vendor- and data architecture-neutral. On the one hand, then, even if the organization discontinues its use of GRAX’s Backup & Restore, Data Archive, or Time Machine services, the historical Salesforce archive is still owned and maintained by the organization itself—in its own cloud, which usually consists of a scalable object storage service such as Amazon S3 or Azure Blob Storage. As for the format of this archive, GRAX archives Salesforce data as an open, compressed columnar format called Parquet. The upshot is that consumers can extract data directly from Parquet into their preferred tools or engines (Power BI, Salesforce Tableau, AWS Redshift, Azure SQL Data Warehouse, Snowflake, etc.); expert users – i.e., business analysts, data scientists, data and ML engineers – can build pipelines that either invoke cloud services (Amazon Glue, Azure Data Factory, etc.) or spin up compute engines (Amazon EC2, Azure Spark, etc.) to cleanse and engineer this data.
On the other hand, GRAX’s technology is data architecture-neutral. It works just as well with a data lake – for example, the Salesforce archive can also be used to populate the lake – or, even better, as part of a data fabric or data mesh architecture. In either of these scenarios – data fabric or data mesh – the line-of-business can optionally retain ownership and management of the Salesforce archive, too. (In data mesh architecture, especially, the line of business is expected to own and manage its data.) Once again, the history lives in the customer’s own cloud. The organization uses data virtualization to expose this data to consumers in other business function areas. Consumers use metadata catalog technology to discover data; ETL developers, business analysts, and other expert users can likewise construct unified business views that integrate the Salesforce data with data from other business function areas.
Concluding thoughts
GRAX’s vision isn’t a one-size-fits-all prescription for cloud data access. Right now, GRAX focuses exclusively on Salesforce as a cloud data source and supports only a couple of analytic targets (Amazon Redshift, Amazon QuickSight, PowerBI by Azure, Salesforce Tableau, and Snowflake). Nevertheless, GRAX’s solution comprises a clever, pragmatic response to several exigent problems that organizations will encounter as a cost of doing business in the cloud.
These are:
SaaS backup and recovery. SaaS customers are responsible for backing up and restoring their own user data. Today, this is a responsibility that many organizations tend to view as a minor annoyance.
Problems with API-based data access. In the data-access status quo, hundreds or even thousands of human and machine consumers connect to API endpoints exposed by SaaS providers to access data. Providers limit the rate or the intervals at which subscribers can call and request data. Not all consumers require real-time data; however, consumers that do can be starved of data by concurrent requests from consumers that don’t. This ad hoc model is ungoverned, costly, and unsustainable.
Data history is essential grist for ML/AI, data mining, and core business analytics. Most organizations likely preserve at least a portion of their overall Salesforce data history. They integrate data from Salesforce into a data warehouse, into sales data marts, etc. But data warehouses and data marts ingest a relatively small fraction of raw Salesforce data. The rest is lost forever. The problem is that this data, though potentially not of use today, could become useful in the future: the chaff become wheat, so to speak. This is especially true for data mining, ML/AI, and similar advanced analytics practices, which sift through raw data to discover heretofore unknown patterns, signatures, etc.
Salesforce data history is preserved as a searchable, queryable archive. GRAX’s Time Machine creates the equivalent of a massive time-variant repository of all Salesforce data. If this sounds like a Salesforce-specific data mart, that’s because – in effect – it is. Consumers can use their preferred tools to explore, discover, query, analyze, etc. the historical Salesforce data preserved in this archive.
GRAX’s solution is vendor- and data architecture-neutral. Historical Salesforce data gets persisted into the customer’s own cloud and stored in an open format (Parquet), so it works without GRAX. The Salesforce archive can populate a downstream data lake or be exposed – in situ – via data fabric / mesh architecture. The Salesforce archive can coexist with both a data lake and a data fabric / mesh.
[1] On its face, a data mesh architecture seems similar to a “data fabric,” and data fabric, for its part, seems similar to data virtualization. (This is no accident: think of data fabric as an evolved or higher species of data virtualization.) As if to compound the profusion (and confusion) of terms, data mesh architecture makes use of data virtualization, too. So, is it data virtualization – or data federation? – all the way down? How are we to distinguish between the different models? I think Lawrence Hecht put it aptly when he observed that data mesh and data fabric actually comprise two quite different responses to the data access and management problems that are byproducts of the primordial distributedness of data. (That is, useful data is always already distributed. Full stop.) As Hecht sees it, data mesh architecture anticipates and aims to accommodate (as inevitable) the tendency of organizations to change – sometimes radically. My own take is that data mesh is more than just a data architecture and much more than just a technology prescription: it is a different way of thinking about how an organization uses, manages, and governs its data. In this way, it has an ideological dimension that data fabric does not. Like a data fabric, a data mesh is decentralized by design: if you grok the basics of mesh networking, or the concept of a service mesh, you grok what a data mesh is and how it’s different. Unlike a data fabric, however, a data mesh architecture is also politically decentralized: it draws on the concepts of domain-driven design to reapportion responsibility for the ownership and provisioning (or “serving”) of data. In other words, data that is produced (and disproportionately consumed by) finance is owned and served by finance. Ditto for sales and marketing, HR, etc. Data no longer flows from the periphery to the center; rather, data resides in situ – that is, where it is produced – and gets accessed in situ. If a consumer in HR wants data that is produced and owned by finance, she goes to finance. If a business analyst wants data that is produced and owned by finance, sales, and/or logistics, she goes to these domains. In practice, she doesn’t actually “go” anywhere: data mesh, too, relies on data virtualization, metadata cataloging services, and other technologies (e.g., automated, ML-powered data discovery, profiling, and cleansing/preparation tools) to knit together a far-flung federation of data sources.
[2] This is not to presume that a central repository is the only possible solution to this problem. On its own, however, a data fabric/mesh does not address the API-based access and rate-limit problems described above. In the data fabric model, for example, a business analyst who needs data from Salesforce or Marketo could search for this data using a metadata catalog service and access it via a data virtualization layer. So far, so good. But data virtualization is an abstraction layer – a means of transparent data access – that connects data consumers directly to data producers. In this scenario, then, the business analyst is actually getting the data she needs from the upstream cloud provider – i.e., Salesforce or Marketo. The data virtualization layer just makes it easier for her to get this data. (As well as, optionally, to pre-cleanse and pre-engineer it.) A solution to this problem is to cache data from cloud services, and, in fact, data virtualization almost always uses caching to improve performance and to control for the vagaries of geography and network transport. At some point, however, this scenario starts to look a lot like those that we explore in this article. (For example, a data virtualization layer that caches a real-time feed of cloud data is broadly similar to a model in which cloud data gets ingested into a central repository.) The salient point is that – irrespective of the dogmatics of data-architectural formalists – something like a data fabric is not inimical to a data architecture that includes a data lake and/or data warehouse. In practice, the hypothetical models explored in this article could – and probably would – incorporate data fabric-like tools.
[3] The scale changes, too. At a low level, object storage is optimized for BLOB storage – i.e., large objects (e.g., a multi-gigabyte data set; a virtual machine disk image), as distinct to multiple small files or (as with a database) individual records. So, for example, a GET request would grab the entire contents of a data set or compressed file – potentially consisting of several gigabytes of data. Data lake services, which sit on top of object storage, tend to offer more granular access; for this reason, most providers use different billing metrics, e.g., total volume of storage, per-transaction (i.e., read, write, or delete operation), etc. As always, the devil is in the details.
With both cloud object storage and the cloud data lake, the focus of cost savings also shifts from API rate-limit overcharges to data egress charges. To mitigate this, subscribers could deploy a hybrid architecture – with a local (on-premises) data cache – to minimize these charges. Another option is to exploit intra-cloud analytic facilities: e.g., Amazon EC2, Glue, Redshift, etc. running in the context of AWS.
About Stephen Swoyer
Stephen Swoyer is a technology writer with more than 25 years of experience. His writing has focused on data engineering, data warehousing, and analytics for almost two decades. He also enjoys writing about software development and software architecture – or about technology architecture of any kind, for that matter. He remains fascinated by the people and process issues that combine to confound the best-of-all-possible-worlds expectations of product designers, marketing people, and even many technologists. Swoyer is a recovering philosopher, with an abiding focus on ethics, philosophy of science, and the history of ideas. He venerates Miles Davis’ Agharta as one of the twentieth century’s greatest masterworks, believes that the first Return to Forever album belongs on every turntable platter everywhere, and insists that Sweetheart of the Rodeo is the best damn record the Byrds ever cut.