The Cloud Native Data Warehouse Comes into Focus

Too many vendors treat the term “cloud native” as little more than a marketing tool.

Data warehousing specialist Yellowbrick Data is a notable exception to this trend.

At the very least, Yellowbrick seems to grok the sense in which cloud native is its own thing: a radically different way of thinking about, designing, deploying, scaling, and maintaining software.

Yellowbrick’s recent moves indicate that it has made demonstrable progress in adapting its data warehouse platform to run in a cloud native context. This month, it introduced new support for core cloud native concepts and technologies, such as operating system-level virtualization and container-orchestration. Separately, it announced Yellowbrick Manager, management software that it says permits data, workflow, workload, and traffic portability across different infrastructure contexts.

Yellowbrick promises that customers will be able to deploy its Andromeda PaaS data warehouse in Kubernetes (K8s), the open source container-orchestration platform that has emerged as a de facto standard for deploying and managing containers in cloud native software patterns.

Yellowbrick already offers Andromeda as a managed “private-cloud” appliance that can also be exposed – via private links – to services hosted in Amazon Web Services and Azure. But the company also plans to permit customers to deploy Andromeda on Kubernetes running in AWS and Azure. In other words, customers will now be able to run Andromeda in K8s – i.e., Andromeda-as-a-service – in both the on-premises environment and in the hyperscale public cloud. Separately, Yellowbrick said it aims to deliver its Yellowbrick Manager product – a multi-cloud “control plane” that centralizes the task of deploying (or “pushing”) data warehouse instances, along with associated data, workloads, and workflows, between and among on-premises and cloud contexts – in the second half of this year.

The new support for K8s, in particular, is a visible tell that Yellowbrick is paying much more than just marketing lip service to cloud native design. To understand why this is the case, it is helpful to grasp what is wholly new, subtly different, and, above all, useful about cloud native software.

What’s different about cloud native

The core premise of the cloud is elasticity: subscribers can adjust their use of virtual cloud resources in response to changing conditions. However, cloud infrastructure differs radically from non-converged [1] on-premises infrastructure in that it permits subscribers to adjust the resources they use independently of one another, separately increasing (or decreasing) virtual compute, storage, or networking capacity.

This is cloud computing 101. The thing is, this axiomatic formulation has almost nothing to say about the design of the software that must live and run in the context of virtual cloud infrastructure. To oversimplify, software that is designed to live and run in the on-premises data center will not live and run properly in the cloud. It will not perform as well, behave as expected, scale as reliably, or achieve comparable availability. It certainly will not permit elastic expansion or contraction, dynamic provisioning or deprovisioning, on-demand scaling, resilience, or any of a spate of related cloud-specific benefits.

Cloud native software design aims to produce software that comports with – and benefits from – these characteristics: software that is elastic in the sense that it can be provisioned and scaled (or clawed back) as needed. Resilient software that is designed rapidly to recover from failure; adaptive software that can be scaled as needed to support the specific requirements of demanding workloads.

So far so good. Now, what does this have to do with data warehouse systems?

In the first place, it means that an on-premises data warehouse platform will not run, perform, or scale properly in cloud infrastructure. It will be not be as reliable or as available in the cloud as it is in the on-premises environment.[2] This is true of a data warehouse deployed in either the cloud infrastructure-as-a-service (IaaS) or, especially, cloud platform-as-a-service (PaaS) contexts. The good news is that basically every extant PaaS data warehouse service has been (re)engineered to run, perform, and scale as expected in the cloud. This includes both cloud-first PaaS data warehouses (such as Snowflake and Yellowbrick) and the PaaS data warehouse services that are offered by established on-premises vendors. (Amazon’s Redshift is a kind of chimera: i.e., a cloud PaaS that – at its inception, in 2012 – incorporated elements of an on-premises platform: the former ParAccel analytic database.)

In the second place, it is useful to distinguish between “cloud-ready” and “cloud native” software. Think of a cloud-ready data warehouse as one that is optimized for both the benefits and limitations of cloud infrastructure. A cloud-ready data warehouse is elastic, capable of scaling up and scaling down as needed; subscribers can independently provision resources to meet changing demands. For all practical purposes, the cloud-ready data warehouse obviates the concept of storage as a gating factor in scaling a system: the primary emphasis shifts to scaling compute by adding virtual compute nodes.

A cloud native data warehouse is one that’s designed and built in accordance with any of several recognizable software design patterns. The term cloud native has a sort of vague cachet in marketing but it means something very specific to architects, developers, and other technicians. Yellowbrick’s recent Andromeda announcements are cloud native in precisely this sense. And, while the cloud native database is by no means a new thing, the cloud native data warehouse is a much less familiar pattern.

Behold: the cloud native data warehouse

What does it mean to reimagine or to reengineer the data warehouse on cloud native principles?

At a minimum, it entails breaking the data warehouse down into its constitutive functions.

A data warehouse actually does several things. First, it provides a mechanism for storing and retrieving data. To this end, it usually expects to persist data to a file system – a conventional data warehouse does this to its own block-level filesystem, which it alone manages – and likewise exposes APIs (via SQL) that can be used to store, retrieve, and – optionally – manipulate data. This is its storage function.

But a data warehouse is also a data processing platform par excellence. This is its compute function. The data warehouse is optimized for manipulating data structures: the RDBMS engine at its core applies relational algebra to join data structures together to create new, logically valid structures. To process an ad hoc query, a data warehouse must join facts together with dozens or potentially hundreds of different dimensions in or close to real time. To process a data engineering workload, the data warehouse must process and transform gigabytes, terabytes, or even petabytes of data – for example, by performing joins, applying data cleansing and data conditioning rules, etc. These are computationally intensive workloads that benefit from being broken up and distributed – that is, processed in parallel. This is the raison d’etre of the massively parallel processing (MPP) database.

The thing is, there’s no hard-and-fast rule that data warehouse architecture requires a tightly coupled compute-storage nexus, in spite of the fact that it is usually deployed as an RDBMS. [3] At a high level, the data warehouse describes a conceptual architecture that is neither platform- nor technology-specific. Q.E.D.: the data warehouse’s storage function is notionally distinct from its compute function.

In fact, a software architect who specializes in cloud native design[4] might think to herself: “If I could separate the data warehouse’s compute from its storage functions, I could create a more portable, scalable, resilient system.” Were she to analyze the MPP database kernel that performs the work of scheduling and managing operations between the nodes in an MPP cluster, she might see in it a built-in model for decomposing and distributing SQL-processing workloads. Similarly, in cloud object storage-as-a-service platforms (e.g., Amazon S3, Google Cloud Storage, or Azure Blob Storage) she might see a pre-abstracted mechanism by which data can be stored for retrieval or manipulation.

So that’s one huge difference. As we’ve seen, however, cloud native design differs from conventional software design in that it presupposes ephemerality: instead of assuming that an application or service should be maximally scaled and available at all times, it presupposes an event-driven model in which software is dynamically provisioned in response to event triggers. To reimagine the data warehouse as a cloud native platform is, then, to recast it as a dynamic resource for data processing: a platform capable of allocating (or reclaiming) compute capacity as needed; a platform in which compute instances spawn, do some work, and terminate; a platform capable of spawning dozens, potentially hundreds, of compute instances to accommodate very large data-processing workloads.

As a thought exercise, it is simple enough to translate data warehouse architecture into equivalent cloud native concepts. Instead of deploying a data warehouse as a clustered MPP database – with its total data volume distributed across a fixed number of nodes, each with fixed amounts of compute, storage, etc. – the functions of the data warehouse could be divided between a data processing tier, orchestrated by Kubernetes, and a storage tier, which might persist data to and retrieve data from cloud storage. This is the pattern Yellowbrick plans to support with Kubernetes and Andromeda.

It must be stressed that cloud native deployment is not the only way forward for the cloud data warehouse – or for Yellowbrick itself. For most workloads, it is likely that a PaaS MPP data warehouse service – a category that includes Yellowbrick’s own Andromeda platform – will perform and scale better than an MPP data warehouse deployed in Kubernetes. As a software pattern, the cloud native data warehouse comprises a useful tool for deploying, delivering, and maintaining event-driven MPP SQL services. The ability to deploy an MPP SQL engine on a platform such as K8s could permit data and software architects to deliver, maintain, and more easily scale SQL processing capabilities for certain types of practices and use cases, simplifying access to the warehouse – and its best-in-class SQL processing engine – not just for data scientists, ML and AI engineers, and BI discoverers, but for software architects and developers, too. Cloud native deployment recasts the data warehouse’s compute function as a textbook event-driven service – i.e., a service that spawns in response to an event trigger, performs a task, and terminates automatically. Cloud native deployment could make it easier for data and software architects to expose MPP SQL processing to human and machine consumers alike, exposing the data warehouse’s compute function as just one of the hundreds, thousands, or (notionally) tens of thousands of services that constitute an event-driven architecture.[5]

To sum up, a cloud native pattern of this type – a containerized MPP SQL engine that is instantiated as a Kubernetes pod – is akin to a useful adaptation, a Panda’s Thumb, in the evolution of the cloud data warehouse; it is not, however, a formal or binding prescription for that evolution.

Right now, the cloud native data warehouse makes sense for certain specific use cases: for example, as a tool for delivering query-as-a-service (QaaS) or MPP SQL-processing-as-a-service capabilities, or as a bedrock component of an event-driven architecture. In the same way that the panda’s “thumb” is not actually a thumb, but, rather, an exaptation that developed in response to specific environmental requirements, the cloud native data warehouse is an adaptation to the problem of scaling software functions (in this case, SQL processing) in response to unknown or unpredictable demand.

The takeway

At this point, a cloud native data warehouse based on K8s does not make much sense as a blueprint for a production data warehouse; it likewise makes little sense as a primary option for parallelizing SQL processing to support production decision-support, data integration, advanced analytic, etc. workloads.

Yellowbrick’s own product roadmap demonstrates this. Concomitant with its support for K8s, Yellowbrick announced major hardware changes to its Andromeda data warehouse systems. (Again, Andromeda is the equivalent of an on-premises PaaS appliance that the provider – Yellowbrick – is responsible for managing.) For one thing, Yellowbrick replaced the Intel Xeon chips that used to power Andromeda with significantly more scalable EPYC processors from Advanced Micro Devices, or AMD.[6]

In addition, Yellowbrick enhanced Andromeda with NVMe storage caches to accelerate workload processing. (AMD’s EPYC chips boast more PCI-E lanes to support bandwidth-hungry resources such as NVMe.) And Yellowbrick, like data warehousing vendors Netezza and Kickfire before it, also plans to outfit Andromeda with its own proprietary silicon: namely, its Kalidah “scan accelerators.”[7] Kalidah is an FPGA processor: basically, tabula rasa silicon that a customer (in this case Yellowbrick) programs to perform specific tasks. Yellowbrick says that its Kalidah processors help accelerate common tasks such as large table scans, along with tasks such as data validation, decompression, and filtering.

Yellowbrick’s emphasis is on engineering beefier, more scalable systems in order to boost concurrency and support the most demanding SQL workloads. But the material point is that Yellowbrick’s Andromeda appliances are highly optimized for performance, scalability, and simplified manageability. Their vastly enlarged core densities permit customers to scale these systems “up” – i.e., configure beefier per-node compute and memory configurations – as well as (in an MPP cluster) “out.”

These systems are engineered to support demanding data warehousing use cases and workloads in precisely the way that a cloud native data warehouse deployed in and managed by Kubernetes is not.

And, based on extant technology, probably could not be. Not in the near term, at least.

Coda

We are seeing a similar revolution play out in the cloud infrastructure space, too. Earlier this month, Amazon introduced AQUA, a new Advanced Query Accelerator option for its Redshift PaaS MPP data warehouse. AQUA takes advantage of what Amazon has done with AWS Nitro – custom-designed hardware that undergirds EC2, EBS, and other core AWS services [8] – and what Amazon calls “custom FPGA-based acceleration.” AQUA basically permits Redshift to process certain kinds of (computationally intensive) tasks “closer” to the storage context in which the warehouse data lives.

To understand what this means, and to grok why it matters, it helps to remember that most PaaS data warehouses read data off of an object storage tier, similar to Amazon S3. On- or off-premises, object storage offers relatively fast sequential read and write performance and also tends to be cheaper than alternatives. The problem is that many types of data warehouse workloads (such as ad hoc queries) involve non-sequential, that is, random, data access – and that object storage has much higher latency than non-virtualized (direct-attached) storage, on-premises storage area networks, or, for that matter, other cloud storage services, such as block storage (e.g., Amazon EBS). What is more, object storage is less consistent – its performance less predictable – than direct-attached storage, or cloud block storage, for that matter. Enter AQUA, an optimized hardware solution developed by a PaaS data warehouse vendor to partially offset the built-in limitations of its own public cloud infrastructure.

Even prrior to AQUA, Amazon introduced SSD-powered caching services for Redshift. It is not alone. Other PaaS data warehouse players also use SSD caching to compensate for cloud’s limitations.

Like it or not, however highly optimized data warehouse kit is still a necessary evil. To wit: virtually all PaaS data warehouse providers also offer on-premises appliance versions of their PaaS data warehouse systems. (These are comparable to Yellowbrick’s Andromeda appliances.) These systems are useful for several reasons, not least as platforms for workloads that organizations are either unwilling (because of perceived risk) or unable (because of regulatory statutes) to move to public cloud infrastructure. However, as a function of the innate characteristics, constraints, and vicissitudes of public cloud infrastructure, some proportion of on-premises workloads just will not move to the off-premises cloud. For this reason, a data warehouse system running on highly optimized hardware is and will continue to be best option for hosting the most demanding decision-support workloads.

As Amazon AQUA demonstrates, cloud infrastructure and service providers are pulling out all of the stops to address and redress the shortcomings and vicissitudes of the cloud model.

And the thing is, they very likely will succeed.

An analogy comes to mind. When Philips introduced the compact cassette in the mid-1960s, it was not originally marketed as a high-fidelity audio product. Instead, it was engineered primarily for convenience: a portable, reasonably accurate recording medium for specific applications. But the cassette’s convenience, low-cost, and versatility proved irresistible to consumers. Amazingly, and in the space of just 25 years, it developed into a genuine high-fidelity audio medium, such that – by the time Dolby Labs introduced its Dolby S noise reduction system in 1989 – the lowly compact cassette could approximate the signal-to-noise ratio of the compact disc. Other enhancements, from the use of analog phase-locked loops to monitor bias equalization, or the development of ingenious mechanical-electrical schemes to ensure proper azimuth alignment, shored up other obvious shortcomings, as did the use of dedicated microprocessors to regulate drive motors, minimizing analog wow-and-flutter.

The cassette ultimately failed, but cloud will not. And it is likely like that – given enough time – cloud infrastructure will evolve into a suitable home for all workloads, irrespective of their characteristics.

_______________________________________________________________________

[1] Converged and, especially, hyper-converged infrastructure apply this same philosophy in an on-premises context.

[2] For more on this, see my book, Automating the Modern Data Warehouse. O’Reilly Media: March 2021.

[3] In a sense, and notwithstanding SQL’s usefulness as a tool for translating declarative statements into relational algebraic expressions, it is historical accident that the RDBMS is closely associated with data warehouse architecture. In the 1980s, the RDBMS was just the most convenient context (i.e., single system) in which to store and perform operations on data.

[4] The logic of cloud native design is in the tradition of minimalist software development philosophies (like Unix), which hold that software programs should be designed to perform specific functions. For more on this and on other aspects of cloud native design, see Chapter 5, Section 3 in my recent book, Migrating Applications to the Cloud. O’Reilly Media: 2021.

[5] The usual caveats apply – starting with the obligatory caveat that the cloud native technology stack is still (comparably) immature. When Mike Loukides and I looked at the state of microservices last year for O’Reilly Radar, for example, we found that the adopters who had had the most success with microservices tended to use a conventional RDBMS as distinct to either (a) an RDBMS running in/as a container or (b) some other means of managing data persistence and data processing. See Swoyer, Steve & Loukides, Mike. Microservices Adoption in 2020. O’Reilly Media: July 2021.

[6] AMD’s enterprise-class EPYC processors scale to 64 cores and 128 threads; by contrast, Intel’s largest Xeon chips top out at 28 cores and 56 threads on a single processor.. The EPYC and Xeon chips support basically the same x86 instruction set as well as many, although not all, of the single-instruction-multiple-data (SIMD) extensions that both companies have implemented to boost on-chip parallelism. For example, Xeon supports Intel’s AVX-512 extensions, which can significantly improve performance for certain kinds of workloads. On the whole, however, the AMD chips tend to offer much greater core and thread densities and – on a per-core basis – consume less energy and dissipate less heat than the Intel chips.

[7] Netezza pioneered the use of optimized silicon (its “snippet processing units”) – in its data warehouse appliance systems. Kickfire, acquired by Teradata in 2010, developed proprietary ASICs to accelerate query processing.

[8] The linked-to article does a great job of explaining what AWS Nitro is and – more important – why it matters. Think of Nitro as a means of “de-virtualizing” (so to speak) a server, at least from a subscriber’s perspective: that is, offloading certain components of the virtualization workload to dedicated (Amazon-developed) silicon, as a result freeing local compute, memory, network, storage, etc. for the subscriber’s use. Quoting from Hamilton: “This allows EC2 instances to have access to all cores – none need to be reserved for storage or network I/O. This both gives more resources over to our largest instance types for customer use – we don’t need to reserve resource for housekeeping, monitoring, security, network I/O, or storage.”

About Stephen Swoyer

Stephen Swoyer is a technology writer with more than 25 years of experience. His writing has focused on data engineering, data warehousing, and analytics for almost two decades. He also enjoys writing about software development and software architecture – or about technology architecture of any kind, for that matter. He remains fascinated by the people and process issues that combine to confound the best-of-all-possible-worlds expectations of product designers, marketing people, and even many technologists. Swoyer is a recovering philosopher, with an abiding focus on ethics, philosophy of science, and the history of ideas. He venerates Miles Davis’ Agharta as one of the twentieth century’s greatest masterworks, believes that the first Return to Forever album belongs on every turntable platter everywhere, and insists that Sweetheart of the Rodeo is the best damn record the Byrds ever cut.

The Cloud Native Data Warehouse Comes into Focus

About Stephen Swoyer

Related Articles: