by Stephen Swoyer
The premise of the data warehouse in the cloud is hard to resist. For one thing, it promises to reduce (or to eliminate) the costs associated with acquiring, operating, upgrading, and maintaining the hardware and software used to power the data warehouse. For another, the cloud data warehouse lowers the cost-of-entry for access to technologies such as massively parallel processing (MPP) database software. Lastly, it simplifies the way that data warehouse systems are sized, configured, and deployed, making it possible to create and destroy capacity as needed. What’s not to like?
A recent episode of DM Radio, a syndicated weekly show that focuses on data management, explored this question at some length. During one memorable segment, host Eric Kavanagh and guest Chris Twogood, senior vice president of marketing with Teradata Corp., discussed the difficulty of assessing, much less of predicting, the price/performance of data warehouse services in the cloud.
In practice, Kavanagh argued, the ostensible transparency of cloud pricing – e.g., cloud providers usually publish the per-second, per-minute, per-hour, etc. cost of using their services is belied by the opacity of cloud price/performance, the cost of moving data out of the cloud, and other charges. The problem is that virtual cloud infrastructure – and virtual cloud storage, in particular – does not behave in the same way that non-virtualized infrastructure behaves. This has implications for data warehousing workloads, Kavanagh noted. In his own remarks, Teradata’s Twogood expanded on Kavanagh’s claim.
“You cannot … say, ‘I want to run a query in a cloud data warehouse and here’s how much it’s going to absolutely cost me.’ That’s a great place to get to … but it all depends on how many other workloads are hitting that same environment, it depends on what [database] software you’re using,” he argued.
“[Customers] might start out with a … demo, and say, ‘Oh … that’s going to be my price … and then they start adding users, they start adding concurrency, they start adding more complex queries, they start adding [extra MPP nodes], and next thing you know their cost is four or five or six or seven times what it originally was because the [cloud database] software was not designed for limited resources.”
Twogood has a point, although not exactly for the reasons he describes.
But Twogood is also a product marketing specialist with a vendor that sells cloud data warehouse services. Obviously, Teradata has its own unique story to tell about the data warehouse in the cloud – as do its competitors. (We explore what these messages are and how to interpret them in a companion piece.) Without going too far into the weeds, let’s stipulate that some providers question the cloud-worthiness of competitive platform-as-a-service (PaaS) data warehouses, which, they say, do not make meaningful orauthentic use of cloud features such as elasticity or ease-of-use; alternately, other providers, such as Teradata, question the scalability of their PaaS data warehouse competitors, because, they say, not all PaaS data warehouses are as scalable, or efficient, as others.
Virtual cloud infrastructure, explained
Customers may have the sense that, pound-for-pound, cloud infrastructure is neither as fast nor as efficient as on-premises infrastructure. They might likewise understand that this performance penalty is a function of virtualization – that is, the use of software to abstract physical resources, such that (for example) a single physical microprocessor with 24 cores and 48 threads is virtualized as 48 logical processors – and of the vicissitudes of the cloud hosting model. In practice, however, customers tend to view the public cloud as a practically inexhaustible resource. This fosters the belief that – even if virtual cloud resources are not as performant as physical resources – it is nonetheless possible to provision sufficient virtual resources to more than offset the cloud performance penalty.
Twogood addressed this perception during his interview with DM Radio. “One of the challenges you have in the cloud … you can get the performance. You can scale out with that elasticity, but what does that cost? Because you maybe get the performance, but if you have software that relies on the unlimited resources of the cloud, when elasticity goes up, guess what goes up also? Your cost. You’ve got to make sure that you’ve got a foundation that really manages with limited resources to deliver the best price/performance.”
To appreciate the implications of Twogood’s claims about PaaS data warehouse performance, it is necessary to understand what is new or different about the physical infrastructure that undergirds the virtualized infrastructure on which the PaaS data warehouse runs. As Han Solo himself might put it, scaling a data warehouse in the cloud, like traveling through hyperspace, ain’t like dusting crops.
In the PaaS public cloud, a virtual processor is mostly analogous to a virtual processor as instantiated in, e.g., an on-premises enterprise private cloud. But PaaS cloud storage is different from the virtual storage pools that an IT organization might create in an on-prem enterprise private cloud. More important, neither of these virtual resources is comparable to the high-speed, low-latency storage that an MPP database expects to use in order to access data in anon-premises MPP data warehouse cluster.
All PaaS data warehouse services are hosted by hyperscale public cloud providers such as Alibaba Cloud, Amazon, Google Inc., IBM, Microsoft, Oracle, etc.; in practice, then, the PaaS data warehouse is a hosted service that layers on top of virtualized cloud infrastructure. Hyperscale providers license infrastructure capacity to cloud PaaS providers, who, in turn, license their own services to subscribers.
Even though hyperscale providers usually offer different types of storage, most PaaS data warehouses run on top of an object storage tier, similar to Amazon S3. Object storage has a few advantages, starting with its cost: it is cheaper than alternatives (such as block storage) and it offers relatively fast sequential read and write performance. One obvious problem with this is that many types of data warehouse workloads (such as ad hoc queries) involve non-sequential, that is, random, data access.
A bigger problem has to do with how this storage is accessed. In an on-premises MPP cluster, each node connects to its storage via a high-speed/low-latency interconnect such as InfiniBand; in the cloud, MPP nodes usually connect to object storage via TCP/IP and Ethernet; in a hyperscale hosting context – e.g., Amazon Web Services, Google Cloud, IBM Cloud, Microsoft Azure, Oracle Cloud – these links might run at 25Gb/s or greater. This permits fast throughput (up to 3GB/s), albeit at the cost of the higher latencies associated with Ethernet. So on the one hand, object storage has much higher latency than non-virtualized (direct-attached) storage, or, for that matter, other cloud storage services, such as block storage (e.g., Amazon EBS). On the other hand, object storage is less consistent– its performance less predictable – than direct-attached storage, or cloud block storage, for that matter.
Why does this matter? The short answer is that an MPP database needs reliable access to data to work correctly – that is, to break up and distribute a workload across multiple nodes and, in so doing, to reduce the time it takes for the database to perform the operations prescribed by that workload. But workloads are not just broken up and distributed across the individual nodes that constitute the MPP cluster, but, also, across the multiple processors, or workers, that populate each node. In this context, a slow worker, starved for data, can trigger the equivalent of a dependency trainwreck: other parallel jobs crash to a halt, waiting for the dependent worker to finish its task. This is why predictability – usually expressed as a function of low-latency – is so important in databases and parallel computing systems.
Cloud providers address this in different ways. For example, many now offer on-premises “public-private” clouds: that is, hardware and software bundles that effectively bring the PaaS data warehouse to the enterprise data center. (Be it in a multi-rack appliance form-factor or, alternately, in the form of a shipping container.) But these SKUs are priced at a premium relative to each provider’s basic cloud service. In the public cloud, proper, PaaS providers can usually do a few things to offset the unpredictability of object storage. Some, including Teradata, use block-level storage, which, again, helps reduce latency. And most PaaS providers also use caching (in-memory, NVME, SSD) to improve performance for common reports, queries, etc. In practice, this is sufficient for a majority of workloads.
How a PaaS data warehouse deals with latency, especially, determines how well it will perform in the cloud context. But its ability to wring extra performance out of virtualized processors, virtual memory, and high-speed virtual interconnects is also critical. What is most frustrating is that – notwithstanding all of these remediations – there are practical limits with respect to what PaaS providers can do.
And this gets at something that is implicit in what Twogood said: namely, that some on-premises workloads just will not be able to move to the public cloud; that, at this point – as a function of the features, constraints, and (yes) vicissitudes of public cloud infrastructure – the PaaS data warehouse cannot scale to support the most demanding workloads. Even though Twogood himself would not go so far as to say this, it is implicit in what he did say.
Predicting and assessing price/performance in the cloud
With this as background, what conclusions can we draw about the scalability and price/performance of cloud data warehouse services? Now as ever, the answer is a familiar one: it depends. Which is to say: it depends on what would-be subscribers want to do, on the types and varieties of the workloads they want to run, on the complexity of their queries, and on the number of simultaneous users (i.e., “concurrency”) that they want to host – both on average and during periods of peak demand.
It depends on other things, too, such as the design and complexity of their ETL/ELT processes, on the extent to which their data warehouse system is interpenetrated with their existing (on- or off-premises) business processes, and, of course, on the size or scale of the cloud data warehouse system itself.
At a basic level, it depends (a) on whether a customer is completely new to data warehousing or, conversely (b) has an existing data warehouse. It depends, as well, on (c) what this customer plans to do with this existing data warehouse once it spins up its new data warehouse system in the cloud.
For example, a cloud subscriber that is mostly new to data warehousing enjoys a degree of flexibility that a subscriber which has an existing data warehouse system does not. In the same way, an organization that intends to extend its on-premises data warehouse to the cloud – creating a hybrid data warehouse – or which expects to distribute certain types of data warehouse workloads across two or more cloud services (a hybrid-multi-cloud data warehouse) has considerably more flexibility than an organization that plans to migrate, in toto, its existing data warehouse to a single cloud service.
One common hybrid data warehouse scenario involves shifting specific workloads – typically, test-dev, disaster recovery, and analytic discovery – to the cloud context. An organization that employs a hybrid-multi-cloud scenario might seek to complement its on-premises data warehouse system by exploiting the desirable features or capabilities of two or more PaaS data warehouses. These might include inexpensive on-demand capacity – useful not only for analytic discovery, but for data scientists, machine learning (ML) engineers, data engineers, and other technicians who design pipelines that entail scheduling distributed data processing and data movement operations – or integration with cloud-adjacent software development, data integration, ML, artificial intelligence (AI), etc. services.
Extending the data warehouse in this way does not preclude moving a large share or even a majority of on-premises workloads to the cloud, with the result that, over time, the PaaS data warehouse could draw the on-premises data warehouse (along with a constellation of multi-cloud data warehouse resources) into its orbit. Today, however, the cloud data warehouse is neither an ideal nor, in some cases, a practical destination for all on-premises workloads. On the one hand, moving certain on-premises workloads could require that subscribers over-provision their cloud data warehouses, negating some or all of the cost benefits of the cloud model. On the other hand, organizations with very large and complex data warehouse implementations – with sizes in the hundreds-of-terabytes to multi-petabyte range, for example will find it difficult to move these systems, in toto, to the cloud.
Lastly, some workloads still cannot move to the cloud for statutory or regulatory reasons.
Data warehousing luminary Bill Inmon spoke to this issue on the same episode of DM Radio. “Some organizations … for various and sundry reasons can’t put their data on the cloud. Military, medical information are two [verticals] that come to mind … that they simply don’t let that data out of their domain, and in some cases there are legal reasons why they have to do that,” Inmon commented, acknowledging that this “represents only a small fraction of the organizations that are out there.”
Inmon also hit on one of the most important factors that is spurring data warehouse migration – or, put differently, extension– to the cloud: its appeal to and usefulness for the line of business. Rather than reducing costs, the data warehouse in the cloud has the potential to enfranchise the lines of business.
“I think underlying the movement of data warehouses to the cloud is the fact that data warehouses when implemented by the IT dept have not been implemented very well and over and over we see the control of data and the data warehouse go to the different functional organizations: marketing, finance, sales, and organizations like that,” he noted. “I think that one of the big drivers of moving information to the cloud is the fact that it’s easier for marketing, finance, and other [functional areas of the] organization to take control of their own data rather than have … IT … build and manage this data.”
Like it or not, the on-premises data warehouse will be with us for a long time to come. And this isn’t such a bad thing. A hybrid on-premises + cloud data warehouse deployment permits an organization to distribute use cases and workloads – as well as costs – on a more or less rational basis.
True, some costly workloads might not be able to move to the cloud: not today, not next year, and maybe not in half-a-decade’s time. But cloud infrastructure will continue to improve, and cloud providers will continue to innovate to offset the performance penalty that is, in a special sense, a feature – not a bug! – of the cloud model. Pragmatic, forward-thinking organizations grok this. Do you?