“Reality Doesn’t Fit on Disk!”
So claims Chris Sachs, the CTO of Swim, and co-host of SoftWare In Motion. Chris was discussing the important subject of data relevance with Kavanagh and Bloor in a recent episode of Future Proof. Before we delve into what he means by his bold assertion, let’s provide a backdrop to the realtime world that we now inhabit.
The Latency Curve Graphic above provides a historical perspective on how IT has accelerated exponentially since it emerged 70 years ago. The vertical axis of the graph marks out response times on a logarithmic scale stretching from days down to ten-thousandths of a second. The horizontal access is simply a measure of the workload size an application presents to the computer. The curves that emerge from the top left of the graph mark out, decade by decade, the acceleration of computing.
The graph illustrates that we have gone from applications that took days and didn’t achieve much to right now, where applications can be literally faster than lightning, and perform heavy workloads in small intervals of time. It is important to understand that this continual acceleration has wrought undreamed of changes in the both the user interface and how data is processed.
It was this that Chris Sachs was speaking about when he said, “Reality doesn’t fit on disk.””
An Astronomical Parallel
It is sometimes pointed out that the light from the Sun take 9 minutes to get to us. Even at its closest approach to Earth, light from Jupiter takes four times as long. Go to the limits of the solar system, and the light from the heliopause takes 18 hours to get to us. Light from the nearest star (Proxima Centauri) takes over 4 years to arrive. Looking deeper into the cosmos we see distant objects where the light takes millions of years or more to arrive.
Nevertheless if you lie on the ground and look at the night sky, you get no sense of those distances and how our view of the heavens is composed. You think it’s all real-time.
The same is true for a computer processing data. The busy processor crunches the data as directed, but it has no idea of how long the data took to get to it. It doesn’t consider such things. The data it is processing might be stored in cache on the chip, or in memory, or on solid-state disk, or spinning disk, or it may live on some other computing device thousands of miles away.
It is the responsibility of system designers to make sure that the streams of data arrive when needed and are processed at a time that extracts maximum value from the data.
The Value of Data
Who wants yesterday’s papers? Well, archivists might, but almost everyone else has no interest. Data, whether it’s the news that people like to read, or the bits and bytes that chips like to chew on, gets stale and loses value quickly.
When Chris Sachs says “Reality doesn’t fit on disk” he’s painting as whole landscape.
First of all, with current technology, “the data needed to competently drive and validate automated operational decisions is too big to store.” “It doesn’t fit on disk,” is a cute way of saying that any architecture that is based on processing new data at its point of maximum value has to be streams-based and will make no attempt to store the data before it processes it.
If you’re storing everything you receive, you’re missing most of what there is to know. The data of “reality” by which we mean all the data that should influence a decision, is too big for disk, and very little of it needs to be kept once it has served its primary purpose.
A Distinctly Different View of Data Loss
Chris believes that 99.99% of all data that’s ever generated is immediately discarded (and has to be) because it’s too big to store. Not being real-time implies that you’re only processing a small fraction of the data that you might otherwise be able to leverage.
He says, “show me a data source, and I’ll show you all the ways it’s throttled and sub-sampled to fit into storage.”
“Data is effectively ‘lost’ during the time in which it’s being ingested and processed —it goes past its ‘sell-by date’ in a jiffy.
“Data not getting where it’s needed in time, is completely indistinguishable from data loss due to a hardware failure. Lack of visibility into what’s happening in real-time is data loss, pure and simple.”
The outdated architectures that are based on data retention are a problem. The industry currently accepts ludicrously high rates of data loss in the name of “zero data loss.” The data flow design seems to be to lose as much data as you possibly can until you’ve thrown out so much that what remains fits permanently onto multi-region replicated disks.
You should use storage to… store stuff! And not use it as a consistency and resilience crutch; doing so cripples you more than you realize.
Reality and Real-Time
There is literally no technical possibility of processing the full picture of what’s going on except in real-time. If you wait too long, the data is lost forever.
Chris expresses his real-time philosophy in the following way.
“I don’t remember what I ate for breakfast this morning, or any morning. But I’ve still managed to learn what I like for breakfast”
“I analyze what I eat, and then store what I like. I don’t store everything I’ve ever eaten and then periodically analyze my eating habits to determine what I’m statistically most likely to like. That would be stupid. And unnecessarily imprecise.”
“Long term analysis might be useful from a nutritional perspective. My stored recollection of what I tend to eat is like a highly compressed jpeg of my actual food intake: It’s close enough. And it’s small size frees up my capacity to remember other things.”
If you can pick the bones out of that, you’ve probably got yourself a real-time architecture.