Go to Top

Big Data and In-Memory: Are They Related?

Big data, if nothing else, is about lots of data. That’s clear enough, even if the Devil lurks somewhere in the details. As a rough rule of thumb, if the people I talk to are a useful measure, big data means tens to hundreds of terabytes and beyond, and it includes scaling up and scaling out to be able to process whatever workloads one expects to run on such data. Naturally we’re led to think of “disk storage” when we think of big data since holding large amounts of data in memory may not be economically practical.

Read Robin Bloor’s white paper, “Why In-Memory Technology Will Dominate Big Data: In-Memory and the New BI.”

The Use of In-Memory Technology

Currently you can get a terabyte or more of memory onto a server, if you want to pay the price. As far as processing big data is concerned this means that you’d have tens or even hundreds of servers if you wanted to load anything that speaks of big data, and you’d need to have software that could integrate those resources well—assuming you wanted to hold all that data in memory.

And that’s why most companies think that in-memory technology is best deployed to pin OLTP databases in memory and run BI queries off the same databases. You can mirror the servers involved and you get a huge lift in speed. It may indeed be the best use of in-memory technology.

So does this mean that big data and in-memory technology can’t play well together?

Hold on a moment. Let’s just think about how databases normally get their speed. To begin with there’s an optimizer involved, which calculates the quickest way to get at the data that satisfies a given query. The optimizer knows about the available resources and calculates the “cost” of a query in terms of the resources. It knows that reading data from disk is slow, slow, slow. So the database has a caching strategy whereby it holds the most likely data to be accessed in a cache in memory. If the caching strategy is effective then the database doesn’t need to spend much time reading from disk. It also implements strategies to reduce the time spent waiting for data to be written to disk when it is updated.

So now let’s think again about the use of memory for big data applications. Most of the big data applications are analytics. With OLTP we get a transaction that updates or adds some data, and once it is complete the business result has been achieved. OLTP transactions are short, sharp and fast. With analytics, the data analyst conducts a fairly extensive dialogue with the data. There may be many steps to this. In modeling a problem the analyst may read just a small sample of the data and then interacts with it using various statistical techniques. So the first query to the data might hit all the data to extract the required sample, and then the rest of the interactions could be against a small amount of data which can be held in memory and against which various mathematical routines execute.

Once the data analyst is finished with modeling he or she may wish to query all the data, but in doing so may quickly reduce it to aggregated values against which mathematical routines are run. The aggregated values might not consume a vast amount of memory. The important point is that the business result is not achieved until all of this has been done.

So if we are to measure the latency of a data analyst transaction, it needs to involve all the steps mentioned (including, perhaps, data preparation and data cleansing—and also the thinking time of the data analyst) to get to the business result.

We cannot do much about the thinking time, but everything else can be sped up, and if we accelerate the processing side of it we get two useful and profitable outcomes:

  1. The data analyst achieves greater productivity
  2. The knowledge gained by the data analyst can be actioned faster

And So?

The “data analyst transaction” is not a simple one. It varies significantly according to the goal and the nature of the data being analyzed. We cannot simply model it in the way we can model an OLTP transaction. But we know a couple things for sure. First, the transaction will go faster if the most frequently accessed data is held in memory and only has to be read from disk once. Second, it will go faster if we employ as much parallelism as possible.

And this means that in-memory technology and big data, whether they like it or not, must play nicely together.

 

Robin Bloor

About Robin Bloor

Robin is co-founder and Chief Analyst of The Bloor Group. He has more than 30 years of experience in the world of data and information management. He is the creator of the Information-Oriented Architecture, which is to data what the SOA is to services. He is the author of several books including, The Electronic B@zaar, From the Silk Road to the eRoad; a book on e-commerce and three IT books in the Dummies series on SOA, Service Management and The Cloud. He is an international speaker on information management topics. As an analyst for Bloor Research and The Bloor Group, Robin has written scores of white papers, research reports and columns on a wide range of topics from database evaluation to networking options and comparisons to the enterprise in transition.

Robin Bloor

About Robin Bloor

Robin is co-founder and Chief Analyst of The Bloor Group. He has more than 30 years of experience in the world of data and information management. He is the creator of the Information-Oriented Architecture, which is to data what the SOA is to services. He is the author of several books including, The Electronic B@zaar, From the Silk Road to the eRoad; a book on e-commerce and three IT books in the Dummies series on SOA, Service Management and The Cloud. He is an international speaker on information management topics. As an analyst for Bloor Research and The Bloor Group, Robin has written scores of white papers, research reports and columns on a wide range of topics from database evaluation to networking options and comparisons to the enterprise in transition.

4 Responses to "Big Data and In-Memory: Are They Related?"

  • Michael
    March 29, 2013 - 2:23 pm Reply

    The hidden premise here, “In modeling a problem the analyst may read just a small sample of the data and then interacts with it using various statistical techniques.” is operating under a dated paradigm. If we’re to find the proverbial needle in the haystack, which big data purports to accomplish, you can’t throw out any hay just to make an in-memory system work! Building and teaching models based on years of historical data in-memory can’t be done economically for the majority of use-cases. The solutions are typically a poorly executed hack that involves long-term persistence of data in Hadoop – moving it in-memory in chunks, and back out when results are collected. For complex, row-over-row analytics this doesn’t work. Analyze the data where it rests, and get answers in seconds using any number of native Hadoop analytics tools that run on commodity hardware…

  • Mark Diehl
    April 1, 2013 - 2:41 pm Reply

    A couple years ago, when on a panel at a professional meeting, I astonished many in the audience with a prediction that the average person’s data footprint would exceed 1 TB by 2015. By some estimates, we’re already there, so this topic is timely. The In-Memory approach is certainly one option for the physical solution, but it is not a new approach. Thirty years ago our databases were much smaller, our DBMS’s less capable, as was our storage media. We experimented with data access through personal machines, the IBM PC/XT/AT at the time, and big data in that environment was measured in MB to GB. One technique I used then (does that make it a dated approach??) was to buffer-in large segments of a database from hard disk to what we then called RAM-disk, and do a statistical analysis. I used a Mumps environment with dazzling results – the analysis lines on the monitor would fly by. Dated, yes, but still effective even now given our technology advances. I would suggest that an In-Memory solution, like most others, lies in optimum data design rather than throwing more and faster technology on it.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>