Big data, if nothing else, is about lots of data. That’s clear enough, even if the Devil lurks somewhere in the details. As a rough rule of thumb, if the people I talk to are a useful measure, big data means tens to hundreds of terabytes and beyond, and it includes scaling up and scaling out to be able to process whatever workloads one expects to run on such data. Naturally we’re led to think of “disk storage” when we think of big data since holding large amounts of data in memory may not be economically practical.
The Use of In-Memory Technology
Currently you can get a terabyte or more of memory onto a server, if you want to pay the price. As far as processing big data is concerned this means that you’d have tens or even hundreds of servers if you wanted to load anything that speaks of big data, and you’d need to have software that could integrate those resources well—assuming you wanted to hold all that data in memory.
And that’s why most companies think that in-memory technology is best deployed to pin OLTP databases in memory and run BI queries off the same databases. You can mirror the servers involved and you get a huge lift in speed. It may indeed be the best use of in-memory technology.
So does this mean that big data and in-memory technology can’t play well together?
Hold on a moment. Let’s just think about how databases normally get their speed. To begin with there’s an optimizer involved, which calculates the quickest way to get at the data that satisfies a given query. The optimizer knows about the available resources and calculates the “cost” of a query in terms of the resources. It knows that reading data from disk is slow, slow, slow. So the database has a caching strategy whereby it holds the most likely data to be accessed in a cache in memory. If the caching strategy is effective then the database doesn’t need to spend much time reading from disk. It also implements strategies to reduce the time spent waiting for data to be written to disk when it is updated.
So now let’s think again about the use of memory for big data applications. Most of the big data applications are analytics. With OLTP we get a transaction that updates or adds some data, and once it is complete the business result has been achieved. OLTP transactions are short, sharp and fast. With analytics, the data analyst conducts a fairly extensive dialogue with the data. There may be many steps to this. In modeling a problem the analyst may read just a small sample of the data and then interacts with it using various statistical techniques. So the first query to the data might hit all the data to extract the required sample, and then the rest of the interactions could be against a small amount of data which can be held in memory and against which various mathematical routines execute.
Once the data analyst is finished with modeling he or she may wish to query all the data, but in doing so may quickly reduce it to aggregated values against which mathematical routines are run. The aggregated values might not consume a vast amount of memory. The important point is that the business result is not achieved until all of this has been done.
So if we are to measure the latency of a data analyst transaction, it needs to involve all the steps mentioned (including, perhaps, data preparation and data cleansing—and also the thinking time of the data analyst) to get to the business result.
We cannot do much about the thinking time, but everything else can be sped up, and if we accelerate the processing side of it we get two useful and profitable outcomes:
- The data analyst achieves greater productivity
- The knowledge gained by the data analyst can be actioned faster
The “data analyst transaction” is not a simple one. It varies significantly according to the goal and the nature of the data being analyzed. We cannot simply model it in the way we can model an OLTP transaction. But we know a couple things for sure. First, the transaction will go faster if the most frequently accessed data is held in memory and only has to be read from disk once. Second, it will go faster if we employ as much parallelism as possible.
And this means that in-memory technology and big data, whether they like it or not, must play nicely together.