Storytelling in the Cloud: The PaaS Data Warehouse Wars Reconsidered

In an interview earlier this year with DM Radio, a weekly radio show that focuses on data management, Teradata senior vice-president of marketing Chris Twogood argued that it is difficult to predict the performance of real-world data warehouse workloads in the cloud. Because of this, Twogood claimed, customers could rack up unexpectedly large charges as they ramp up their cloud data warehouse services to match or exceed the performance of their on-premises systems.

Twogood may have a point. But his status as a product marketing spokesperson with a large cloud data warehousing provider complicates this point. Simply put, Twogood and Teradata have a story to tell about the data warehouse in the cloud. So do its competitors. It stands to reason that Teradata’s story will put its own cloud data warehouse service in the best possible light – and that its competitors’ stories will be similarly self-serving. So – as regards Twogood’s comments on DMRadio – why is Teradata telling this particular story? Why not some other story? To answer these questions, it is useful to compare and contrast Teradata’s story with the stories of its competitors.

It is illustrative, too. Between them, both Teradata and another competitor – Snowflake – have staked out polar positions at opposite ends of the cloud data warehouse market. The thing is, neither vendor’s story contradicts the other’s, exactly; rather, each story gives emphasis to something different.

To understand this difference is to appreciate what’s at stake in the cloud data warehouse market – and in the red-hot platform-as-a-service (PaaS) data warehouse market, especially.

Storytelling in the cloud: Teradata

Not surprisingly, Twogood’s claims align neatly with Teradata’s messaging in the on-premises data center, where Teradata continues to resell its own branded compute and storage hardware, to develop its own firmware for mainboards, add-in cards, etc., and to optimize its RDBMS engine for instructions and features that are specific to Intel’s Xeon microprocessors. The company says all of this improves the efficiency of the Teradata Database – i.e., its ability to exploit available compute, storage, and interconnect resources, as well as on-chip parallelism, various levels of cache, and the special features of Intel CPUs and chipsets.

With this in mind, Twogood and Teradata argue that even though cloud infrastructure is comparatively cheap, it isn’t free. The subtext of this argument is the claim that virtualized cloud infrastructure is less efficient, and, hence, less performant, than non-virtualized infrastructure. According to their reasoning, a company that plans to migrate or extend an on-premises data warehouse system to cloud infrastructure will require more compute, storage, etc. resources to offset this virtualization penalty. (Again, see the accompanying article for more on this.) Another consideration is that the size of a single-instance MPP data warehouse is usually capped – even in the cloud. (Teradata Vantage in the Cloud, for example, is capped at 128 nodes for a single data warehouse cluster.) Looked at this way, Twogood’s claim is two-fold: first, he is saying that the cost of shifting certain types of workloads to cloud infrastructure could result in unexpected charges that surpass the cost of running these same workloads in a non-virtualized on-premises data warehouse; second, he is saying that because the size (in nodes) of a single-system cloud data warehouse instance is constrained by logical (i.e., a maximum cluster size) or practical (diminishing performance when adding MPP nodes) factors, it may not be possible to migrate especially demanding workloads to just any cloud data warehouse service. As a function of the inefficiency of the service and/or the constraints of software-defined cloud infrastructure, subscribers could not provision sufficient cloud resources to process these workloads.

Twogood makes a third claim, too: namely, that because Teradata Database and its associated software services are designed to do more with less, workloads that do move to its cloud PaaS will, in the main, perform better. On the one hand, this means Teradata Vantage in the Cloud should handle demanding or complex workloads that other PaaS data warehouses cannot; on the other hand, it means that, on average, Teradata Vantage in the Cloud should require fewer resources to process workloads. Teradata claims that this makes its pricing model more straightforward – that is, because customers get the performance they pay for, they are less likely to be surprised by unexpected charges. There is a flip side to this claim, however; Teradata is not necessarily being altruistic: after all, Teradata Vantage in the Cloud is priced at a premium relative to other PaaS data warehouse offerings, so if Teradata’s PaaS is doing more work with fewer resources, customers are also paying more for these resources, too.

Storytelling in the cloud: Snowflake

Teradata’s messaging – and Twogood’s framing – is also likely an attempt to counter the messaging of its competitors, especially that of Snowflake, which positions itself as the first cloud-native data warehousing platform. In its own messaging, Consistent with this, Snowflake positions competitors such as Amazon^[1], IBM, Microsoft, Oracle, and SAP – in addition to Teradata – as “legacy,” cloud-come-lately players. According to Snowflake, these providers essentially retrofitted components of their existing on-premises data warehouse software for use in the platform-as-a-service (PaaS) cloud. Ironically, to the extent that Twogood invokes the pedigree of Teradata’s on-premises software assets to argue for the superior efficiency and performance of its PaaS data warehouse, he doesn’t exactly contradict Snowflake’s claim.

One of Snowflake’s arguments is that cloud infrastructure is wholly different from non-converged on-premises infrastructure, and that software which is designed for one context (e.g., the non-converged data center) will neither behave nor perform the same way in a radically different context (converged cloud infrastructure). Its marketing tends to give less priority to the importance of resource efficiency or performance in the cloud – this seems to be taken for granted – and, instead, emphasizes Snowflake’s ability to exploit the features that are unique to the PaaS cloud, and which are also integral to the technical, economic, logistical, etc. benefits of cloud infrastructure. For example, the resource abstraction that makes it possible to create, pause, resume, and/or destroy virtual data warehouse instances, and to grow or shrink the size of an MPP data warehouse cluster without impacting its quality of service.

Snowflake’s story is that its cloud-first pedigree gives it a distinct advantage relative to its competitors. It says its service is easier to get started with, easier to use, easier to grow, easier to develop for, and easier to manage than competitive offerings – and, not least, cheaper, too.

These are two very different stories. Again, both Teradata and Snowflake seem to have staked out positions at opposite ends of the cloud data warehousing space. And, again, neither provider’s story contradicts the other’s; each just gives emphasis to something different.

Teradata is not saying that (e.g.) the elasticity and ease-of-use features associated with cloud do not matter; it is saying that utilization and efficiency matter as much in the cloud as they do in the on-premises data center. It is likewise drawing attention to the limitations and vicissitudes of virtualized cloud infrastructure. It claims that its emphasis on efficiency gives it an advantage in this regard. Snowflake is not saying that utilization and efficiency do not matter in the cloud; it is saying that software should not behave the same way in the cloud as it does in the data center, and that, in view of the unique characteristics of the cloud, its software gives it an advantage.

By dint of what it does not say, Teradata’s messaging seems to acknowledge the perceived cost advantages of competitive offerings; in the same way, Snowflake’s messaging seems to acknowledge the technological gravitas that some of its competitors might bring to bear in the cloud data warehousing space. You don’t talk about unexpected charges if you aren’t battling perceptions of high cost; you don’t talk about “legacy” vendors unless you’re trying to deflect criticism that your own database might still be maturing. There’s a sense, too, in which both vendors are erecting strawmen: Compared to Teradata, Snowflake looks like a less mature database product; however, it is in no sense a profligate custodian of cloud resources. Compared to Snowflake, Teradata looks like a newcomer to the cloud; however, its PaaS data warehouse is in no sense a retrofitting or retread of its on-premises software for the cloud.

First thing’s first: the PaaS data warehouse is not a retread of the on-premises data warehouse

For the record, Twogood cannot be saying that Teradata’s on-premises data warehouse stack performs more efficiently in the cloud – for the simple reason that Teradata Vantage in the Cloud is not a port of this stack. Rather, Twogood is saying that – as Teradata redesigned and reimplemented its database engine (and other essential software) as cloud-native services – it also applied its traditional expertise in wringing efficient performance out of available resources – in this case, virtualized cloud resources. So far as redesigning-and-reimplementing-for-cloud goes, Teradata is by no means alone. Amazon, IBM, Microsoft, Oracle, SAP, and other cloud data warehouse providers all market PaaS data warehouse SKUs, too.

By definition, the PaaS data warehouse is not identical to an on-premises data warehouse system: it is a hosted service, managed by the cloud provider, which aims to abstract the complexity – that is, the “nuts and bolts” – involved in designing, deploying, managing, and maintaining a data warehouse system. To its credit, Snowflake pioneered this model, delivering a PaaS data warehouse that exposes guided, self-service facilities which simplify (and, where practicable, automate) common tasks. The PaaS data warehouse likewise aims to automate non-trivial tasks, such as changing the size of a cloud data warehouse system, or creating (and destroying) virtual data warehouse instances. Lastly, cloud infrastructure tends to be elastic in the sense that it is possible to expand or contract compute, storage, and network resources independently of one another.

Loose coupling – i.e., the decomposition of core software functions into (more or less primitive) services that are independently instantiated and which communicate and exchange data with one another via APIs – is the condition of the possibility for elasticity. Traditionally, elasticity was not associated with on-premises MPP databases (such as Teradata Database), which used to emphasize tight coupling between compute and storage.^[2]

The new battlegrounds in the cloud

At this point, cloud providers can dispute which PaaS offerings are PaaS-ier – i.e., easier to use, easier to operate, easier to develop for; faster to spin up, pause, and/or resume virtual instances; faster (that is, with minimal disruption to service) when changing the configuration of an MPP database cluster; “smarter” about automating tasks, about scheduling and managing workloads, about managing concurrency; and so on. They cannot argue that their competitors do not market cloud-worthy PaaS data warehouses, however. This is disingenuous marketing.

In the same way, cloud providers can disagree about which PaaS data warehouses perform best – be it in general or for specific use cases, workloads, scenarios, etc. – given the known constraints and vicissitudes of cloud infrastructure. The cloud-worthiness of a PaaS data warehouse is one factor in this calculus: e.g., its ability rapidly to grow its size (or to provision ephemeral capacity) in response to changing conditions or emergent demands. Also, its ability to scale to support highly complex queries, to host dozens or even hundreds of simultaneous (i.e., concurrent) users, to host a wide mix of workloads of different types (e.g., reporting, ad hoc query, machine learning processing, graph traversal, etc.) and, what is just as important, its ability to manage these workloads by giving priority to privileged users, use cases, jobs, etc.

These capabilities will matter to different customers for different reasons. The thing for customers to keep in mind is that each PaaS data warehouse vendor is trying to tell the best – the most flattering – story it possibly can. These stories are useful, so far as they go; as this article demonstrates, however, they’re perhaps most useful as indicia of that which vendors do not address, i.e., what they give short shrift to, the issues they seek to obscure or elide.

Listening for the tell-tale sound of silence

At one point in Cormac McCarthy’s Blood Meridian, Tobin, an ex-priest, explains to the novel’s protagonist – a 14-year-old man-boy whom McCarthy calls “the kid” – that it is possible to perceive something without explicitly seeing, hearing, touching, or smelling it. To illustrate this claim, Tobin uses an example he reckons will make intuitive sense to a group of men who’ve spent most of the novel hunting or being hunted by other men: the uncanny sound of silence.

“At night … when the horses are grazing and the company is asleep, who hears them grazing?” he asks the kid. “Dont nobody hear them if theyre asleep,” scoffs the kid, who has little tolerance for foolish questions. All of this seems obvious enough. But Tobin, like Socrates, is just setting up his elenchic trap. “Aye. And if they cease their grazing who is it that wakes?” he asks.

“Every man,” says the kid, beginning to understand. “Aye,” Tobin avers: “Every man.”

The point of this excursus is that it is important to listen for what sales and marketing people do not say about products – their own and those of their competitors –: to take note of the topics they ignore, avoid, or downplay. To listen, in other words, for the tell-tale sound of silence.

Sometimes this reticence can reasonably be attributed to the vendor’s worldview: Snowflake is not just a cloud PaaS vendor, but the pioneer cloud PaaS vendor; its marketing reflects this. Snowflake might have less to say about the specifics of running mixed-workloads in its PaaS data warehouse because its customers don’t see this as their biggest concern. Teradata is not just a 40-year-old company, but one known for tuning analytic databases for efficiency; its marketing reflects this. Teradata might have less to say about ease-of-getting-started, ease-of-use, and similar topics because its customers don’t see these as their biggest concerns.

But is that all there is to it? Again, this is not to single out Snowflake or Teradata; both vendors have staked out positions at opposite ends of the cloud data warehousing space. To analyze their messaging is to glimpse – as though through binoculars – a stereoscopic view of the entire PaaS data warehouse field. Amazon, Google, IBM, Microsoft, Oracle, SAP, and other providers occupy positions scattered across the broad in-between that is circumscribed by this field.

With respect to each of these vendors, reticence isn’t just a function of what is not said – i.e., of gaps, omissions, or elisions in their messaging – but of the aspects or facets of topics that a vendor and its spokespeople prefer to downplay. For example, a lot of vendors tend to talk about performance in the cloud without going into concrete specifics. They might cite the practical inexhaustibility of cloud resources, or the proven scalability of MPP databases, or the ease with which a subscriber can provision additional MPP database nodes – all without discussing (for example) the average number of concurrent users they support in their largest customer deployments, or the number of concurrent users these large customers host during periods of peak demand. Neither the scale of the cloud nor the theoretical power of MPP – nor, moreover, the ease of provisioning extra MPP nodes – matters if the database engine at the core of a PaaS data warehouse cannot actually scale to make use of this stuff.

For existing customers, a vendor might position its PaaS data warehouse as the most convenient and most logical migration path to the cloud – especially if it knows that a customer is already a licensee of its other products and services. (This is a powerful enticement in cases in which customers depend on vendor-specific operating system / application / database / middleware / and software development stacks.) However, this same vendor might have much less to say about the relationship it heretofore has had with that same customer, about the type of relationship it expects to have with the customer in the cloud, or about alternatives to using its own infrastructure to host its PaaS data warehouse. (Don’t forget that IBM, Microsoft, and Oracle not only market on-premises operating system / database / application / and middleware stacks, but also operate their own hyperscale cloud infrastructure facilities.) Elsewhere, a vendor that touts the low cost of its PaaS data warehouse might have less to say about nitty-gritty details, such as the granularity with which customers can provision resources to scale that service. To cite one common example, some PaaS data warehouses support more granular units of scale – e.g., subscribers can add compute in increments of, say, just one node at a time – as against the larger increments (e.g., a doubling of nodes) mandated by other PaaSs.^[3]

A vendor that downplays the high upfront rates it charges for its own PaaS data warehouse service will naturally invoke the spectre of hidden costs in connection with competitive PaaS offerings. Fair enough. But this same vendor might also be ignoring what its customers actually want to do with the data warehouse in the cloud. A customer that uses a PaaS data warehouse to support the analytic-discovery and data-science use cases probably doesn’t care all that much about the potentially exorbitant cost of (for example) supporting high concurrency levels. If anything, it cares about the ease with which ephemeral virtual data warehouse instances can be created, the APIs they expose for access by non-traditional tools or interfaces, and the fact that – having been easily invoked and transparently accessed – they can easily be destroyed.

Lastly, a vendor that emphasizes its ecosystem of complementary services, or that trumpets the ease with which a subscriber can import or move data into its PaaS environment, or that talks up the convenience with which PaaS subscribers can exploit cloud services that are hosted inside or outside its ecosystem–: this same vendor might be less inclined to offer detailed specifics about the charges it assesses if or when a subscriber moves data away from – that is, outside of – this ecosystem. And it might have much less to say on the subject of interoperability between its own cloud services and those of third-party providers, even though this issue (in particular) is of more than passing interest to subscribers concerned about the possibility of cloud service-provider lock-in. Like it or not, this same issue has more than a little salience for the business-continuity and disaster-recovery-planning use cases, too.^[4]

The takeaway is that each PaaS vendor has its own story to tell. Each is happy to downplay, or to ignore, certain topics. This is a thing that vendors do. And it is the responsibility of customers to call them on it – especially when millions of dollars are on the line. Moreover, honest vendors should welcome questioning, critical or no. The willingness of a sales or marketing person to engage difficult questions without redirecting is also a good indication that they feel fundamentally comfortable about their company’s products and market position. They’re interested in promoting a relationship and managing customer expectations, not just closing a sale. I don’t know about you, but a vendor that feels fundamentally comfortable about stuff like this is a vendor that I would look forward to doing business with. Especially in the PaaS cloud.

^[1] At its inception, Amazon’s Redshift data warehouse service was based on technology Amazon acquired from the former ParAccel, a provider of on-premises data warehouse appliances.

^[2] The node is the basic building block of MPP scalability; you scale an MPP database by adding one or more new nodes. In an on-premises MPP database, the “node,” as such, denotes a fixed amount of compute and storage: neither can be scaled independently of one another. Even if a database needs more compute than storage, each new node must consist of a fixed complement of both resources. See footnote 4 (below) for more on this.

^[3] Sizing and scaling an MPP database is fundamentally different in the PaaS cloud than in the non-virtualized on-premises environment. In the on-premises environment, a DBA might size a system based on the complexity of the most demanding queries, on the sheer number of queries, or on a mix of these things. (Each of these is also a factor in determining the concurrency requirements of the system.) With respect to the on-premises data warehouse, a DBA takes account of these factors to determine the overall size and per-node characteristics of the MPP database cluster – i.e., the number of nodes, each of which has n number of processors and n percentage of the total storage volume of the data warehouse. The cloud effectively breaks this correspondence. In other words, a subscriber scales the performance of the PaaS data warehouse by adding additional compute resources irrespective of storage. Depending on the provider, the PaaS data warehouse exposes two mechanisms for scaling performance: (1) adding extra nodes and/or (2) configuring extra per-node compute.

Unlike an on-premises MPP database, the PaaS data warehouse makes it easier to provision extra nodes; however, not all PaaS data warehouses permit DBAs to adjust per-node compute capacity. (With some providers, configuring additional compute on a per-node basis may require upgrading to a more expensive tier or version of the PaaS service.) In practice, a DBA might want to add extra compute in order to manage an especially demanding workload (e.g., a highly complex SQL query); in the same way, a DBA might want to add one or more extra nodes to support a larger number of concurrent users. PaaS providers support and/or charge for these two use cases in different (more or less granular) ways. Some providers (Snowflake, for example) require a strict doubling of nodes to expand capacity.

^[4] Today, failover from a service hosted in one cloud infrastructure context to a service hosted in a different (third-party) cloud infrastructure context is the stuff of a wicked problem. This is thanks to a spate of factors, including: the comparative immaturity of the cloud infrastructure model, at least relative to non-virtualized on-premises infrastructure; the rapid evolution – fueled by competition between hyperscale providers – of cloud infrastructure services; and, not least, a two-fold problem that involves, on the one hand, a lack of useful open standards and, on the other, a lack of meaningful interest in open standards – especially on the part of the most powerful cloud infrastructure providers. The upshot is that business continuity in the context of the PaaS data warehouse could mainly involve planning for workloads to failover from the on-premises environment to the cloud PaaS data warehouse – which, in fact, is already a popular use case – or from one hyperscale regional hosting center to another. This is acceptable for the purposes of business continuity, but is insufficient vis-à-vis the disaster recovery use case: after all, what happens if a hyperscale provider experiences a catastrophic, multi-region failure? In the same way, this scheme does nothing to address the problem of cloud service-provider lock-in.

Individual PaaS providers may have more compelling stories to tell, however: for example, some providers expose facilities that DBAs can use to “push” (i.e., shift) workloads from one cloud infrastructure context – for example, a Snowflake instance hosted in Amazon AWS – to another cloud infrastructure context, such as a Snowflake instance hosted in Microsoft Azure. (This is specific to providers that host their PaaS data warehouses in more than one hyperscale cloud environment.) Several providers that offer on-premises versions of their data warehouse software also support push options for the hybrid data warehouse: DBAs can push workloads running in an on-premises Oracle or Teradata data warehouse to each vendor’s PaaS, or vice-versa. On the other hand, failing over a generic cloud service running in one provider’s infrastructure (Amazon EMR) to a comparable service running in another provider’s infrastructure (Google Cloud DataProc) is – for all intents and purposes – a roll-your-own solution. And pushing workloads between dissimilar PaaS services is no less a DIY problem: a subscriber that wants to push a workload that’s hosted in Oracle Autonomous Data Warehouse to Amazon Redshift or Azure Synapse Analytics must design its own software and services to support this use case. This is in spite of early and ongoing efforts (e.g., OpenStack) that aimed to permit portability of precisely this kind.

About Stephen Swoyer

Stephen Swoyer is a technology writer with more than 25 years of experience. His writing has focused on data engineering, data warehousing, and analytics for almost two decades. He also enjoys writing about software development and software architecture – or about technology architecture of any kind, for that matter. He remains fascinated by the people and process issues that combine to confound the best-of-all-possible-worlds expectations of product designers, marketing people, and even many technologists. Swoyer is a recovering philosopher, with an abiding focus on ethics, philosophy of science, and the history of ideas. He venerates Miles Davis’ Agharta as one of the twentieth century’s greatest masterworks, believes that the first Return to Forever album belongs on every turntable platter everywhere, and insists that Sweetheart of the Rodeo is the best damn record the Byrds ever cut.

Storytelling in the Cloud: The PaaS Data Warehouse Wars Reconsidered

About Stephen Swoyer

Related Articles: