With the release of its Real-time Compression Appliances, IBM threw another variable into the data center equation. Roughly speaking you can view the situation in the following way:
Data Centers are expensive in terms of space, power, cooling and particularly, labor. And yet the amount of data we store regularly increases by about 55% per year. We may talk about Big Data, and indeed many data centers are assembling big pools of data, but the reality is that the data growth rate has stayed pretty constant for decades. The consequence is that every year data centers need more storage capacity (and processing capacity too, to process all that data).
The idea of pushing all that data into the public cloud to ease the strain on the data center is appealing, but sadly a good deal of data center data cannot be pushed out in that direction for a variety of reasons, most of which relate to service levels. For most data centers the outcome of this situation is to depend on technology solving the problem. So naturally, data compression is a compelling idea.
In simple terms, data compression allows you store more data on the same disk than before, and the level of compression – depending on what data is being compressed – can be very high.
IBM’s Real-Time Data Compression
IBM’s real-time data compression appliances are a neat idea. You add them to the data center network in an appropriate location and they sit there compressing data 24/7 with, IBM claims, no downside. The technology compresses data as it is written to a storage device and decompresses it when it is read without there being any need to change applications in any way. This is, by the way, unique patented technology that is baked into its own Storwize devices, but will happily work with competitors’ storage technology as well.
Critically, it imposes no overhead in performing the compression. Application performance is either unaffected or improved, and storage space will be significantly reduced. The level of compression varies with the type of data, but for most data you get somewhere between a 70%-80% reduction in data volume – and that includes database data as well as office applications and CAD/CAM data. It sits there between the application and the storage device, scrunching up the data on its way to disk or unscrunching it on its way to the application and, in tests IBM has done, it either presents no overhead or it improves performance. Conceptually, you can think of it like this. The appliance compresses data very fast, in a stream. This obviously takes some time, but the disk takes less time to write the data away, and the outcome is either a wash or a performance gain. The same is true for decompression on reading data.
A Philosophical Note
Previous attempts by the IT industry to build this kind of data compression capability failed to avoid a performance penalty and thus have had a far more limited area of application. But it is important to note that such technology does exist and is already deployed in some data centers. And in case you forget, there are a fair number of databases, not just the popular Oracle, DB2 and SQL Server databases. Many of the new scale-out databases (Vertica, ParAccel, etc.) compress data as part of their performance strategy.
Of course this does not invalidate IBM’s technology, since most data center data is not database data at all. But it does pose a question as to whether it is better to do data compression in the hardware or to use the software. To me, it’s a curious question as to whether IBM’s appliance with Oracle’s database data compression turned off is preferable to solely using Oracle’s data compression. I don’t know the answer, and I don’t think IBM does either, at the moment.
The idea of “data compression in the iron” is an idea that has legs. IBM will likely make a success of this family of devices and, in the longer term, the industry may well decide to make data compression a hardware rather than a software feature – hopefully to the point where it attains “commodity status.”
Of course, I doubt whether any of this will serve as a long term solution to handling data growth. Data likes to grow. It always has. It always will.