I keep hearing the words “In Memory Database,” and that’s not surprising, because there are now more in-memory database products than there used to be, and it’s beginning to look like some of these products will become mainstream.
Should we care?
I would have thought so, although perhaps we should care more about the overall impact on software architecture than the emergence of a new class of database product.
Check out The Briefing Room to hear Robin Bloor further discuss the advantages of in-memory technology.
The simple fact is this: memory access is 100,000 times faster than disk access and that disparity in speed has been getting bigger over the years. The situation is that cpu power, bus speeds and memory speeds have been increasing at an exponential rate, roughly doubling every 18 months. By contrast disk read speeds have been improving linearly rather than exponentially. Consequently, the disparity between the two latencies has been increasing and continues to do so.
It is worth noting here that a well designed database can cache data in memory so effectively that the disparity in average access speed between memory and disk access can be reduced to about 1000 to 1. But even so, 1000 times faster is three orders of magnitude faster – and one order of magnitude in speed is usually disruptive. And that’s the point. Memory based software is truly disruptive.
Consider the above representation of hierarchical data storage. Data can be held in RAM providing the lowest possible latency. If it cannot fit into RAM, then the next best option is local Solid State Disk (SSD) This is roughly three times faster than a local Hard Disk Drive (HDD), and about five time faster for writes, but it is significantly more expensive – roughly 15 to 20 times more. So there is a significant cost differential. SAN (and NAS) are slower than local HDD, but have many benefits (bear in mind that prices fluctuate). Roughly, HDD latency can be a third of SAN latency both for single I/Os and for throughput, but this depends on how you configure the HDD resource and the SAN. There are situations where the SAN will perform better than HDD, but SANs are rarely configured for a single workload (such as a single instance of a database); they normally serve many applications.
Below the SAN in the storage hierarchy come archive devices like optical disks and tape. Since these are removable media, access times can be very long indeed. But that’s also their benefit since their capacity for data is almost unlimited. We could argue that there is another “invisible layer” in the hierarchy comprised of the data that you eventually throw away and the data that you never stored because of cost. Clearly, this can never be accessed. However as storage costs decline it becomes economic to store such data and there may be good reason to do so.
The Migration of Data
It is interesting to consider the migration of data between different storage media from the perspective of hierarchical data storage. The cost of all these storage options falls every year. RAM falls at about 30% p.a. on average. Disk storage falls faster at about 35% p.a. SSD is falling faster than disk storage at a rate of about 50% per year. The whole situation is made far more complex by the fact that data volumes are increasing on average at between 50% and 60%.
More data is being stored and technology costs, even though declining at quite a rate, are not declining fast enough to pay for the extra data storage. They may for some organizations, because the growth in data volumes we have given is an average figure and that average masks significant variance between organizations and even whole industries.
Neither is it the case that the declines in technology costs mark out a smooth curve. These cost figures indicate general trends which can be disrupted both by economic conditions and by technology innovation.
Nevertheless, because of the continual decline in costs the general trend is for data to migrate upwards through the “pyramid” of hierarchical data storage and the considerable advantage this delivers is that applications run faster.
Parallelism and Grids
It would be simpler if our diagram presented a comprehensive picture of the situation, but it misses out several factors that simply add to the complications that organizations find themselves in. One of these is parallel processing. It is quite possible to assemble a grid of servers that have nothing but copious amounts of RAM, and use them to assemble very large in-memory stores. If you don’t have any cost contraints and you use the right software (GridGain is a company that provides such software) You can scale out almost indefinitely and hold many terabytes of data in memory. Naturally, you will need to configure a failover capability, but that can be done.
In those circumstances, memory becomes the prime storage medium. Most likely copies of the data will be maintained on a SAN for back-up and disaster recovery. Nevertheless this inverts the normal order of things. Data is maintained in memory and written to disk for back-up purposes only. The motive for doing this is obviously to get that 1000 to 1 advantage in speed and it is achievable if the right software is deployed. But why would you want it?
Why You Might Want It
You may be thinking that your transactional systems are fast enough already. So why bother moving to a scale-out memory-based architecture?
The truth is that almost every organization has disk as its prime storage medium, whether that involves SSDs, SANs, NAS, HDDs or a mixture of all of them. As such the latency imposed by disk access doesn’t just affect some of the applications, it affects them all. So your transactional systems may be fast enough, but their data which travels from the operational databases through staging areas and data cleansing operations, to data warehouses and then to data marts, incurs disk latencies at every step. If memory is the prime storage medium, they all work considerably faster – in fact lightning fast – because the data flow is all memory-to-memory.
The time advantage from migrating a BI architecture into memory could be to cut latency from days or hours to minutes, and this could have considerable business impact, because it reduces “time to action” significantly.
The Devil In The Detail
It would be pleasant if the disruptive technology changes that are now occurring could be addressed simply by buying more power at the hardware level (the latest cpus, more memory, more SSDs etc.) It can’t for many reasons. The primary reason is that most of our applications were not built using in-memory technology. You can pin a whole database in memory, but if it is one of the traditional relational databases it will still behave as though it is pulling data from spinning disk. It will run a great deal faster but it does not have an in-memory architecture. It will be a very bad fit.
Think also of virtual memory technology which happily generates whole virtual machines into which individual applications are slotted. Do we really want large numbers of operatings systems clogging up the cpus and the memory? There has to be, and there is, a better way of scheduling these resources.
And what about the multitude of packaged applications that most companies run. Who is going to rebuild these applications to use memory as the prime storage medium. And what about all the service management software? How is that going to work in such a brave new world? And what about the cloud? How does this impact decisions about which applications to migrate to the cloud and when to migrate them?
A Relatively Slow Boat To China
The move to main memory as the prime storage medium is not going to happen swiftly. Like all technology revolutions of the past, it will be driven by companies picking off the low-hanging fruit first. As usual, this will not be about migrating older applications, so much as identifying new applications that can deliver a real punch if they work at lightning speed. But it will surely happen.
Indeed, this revolution has already begun.