Of course everyone wants faster analytics without the added expense. The primary barrier to this is that big data really can be very big. If you take a large collection of data and drill into it with even relatively simple statistical calculations, it can take a long time. Everything you do to a terabyte or a petabyte of data, for example, will take time and precious resources. To make matters worse, the analytic calculations themselves can also take a lot of time even on small collections of data. And if it takes a long time on a small chunk of data, you can grow old waiting for it to do the same to a large chunk of data.
You may be thinking, “so buy more iron.” Yes, you can head in that direction, but there’s a cost associated with server sprawl that soon gets to be painful. So, what else can you do when the analytics is slow? One solution that would naturally come to mind is to use sampling. However, there are several issues with this.
The Limitations of Sampling
When you sample data for statistical analysis you only use a fraction of the full data set. You hope that the sample you’ve taken (probably selected randomly) will reflect the true distribution of data values in the set. However, you have no way of knowing this for sure – you know for sure when you consult all the data.
Mathematically, data samples are accorded a confidence level that is based on the proportion of the data the sample represents. If you increase the sample, you increase the confidence level, but even with high confidence levels (say 99 percent) there is a chance that the sample is unrepresentative. The sample thus provides you with an approximation, but you cannot know for sure how precise the approximation is.
There are two specific situations where random sampling may prove particularly problematic. One is where the distribution of data values includes a few outliers. Consider, for example, a set of amounts that tend to vary between $0 and $10,000. If there are just a few in a set of a several million with values between, say, $10,000 and $20,000,000, a fairly small random sample is likely to miss them all. Or if, against the odds, it catches one or two, it will give a distorted impression.
A different and far more complex sampling problem arises in the situation where data is joined between two or more tables. Long academic papers have been written about this so we will not dwell on it. However, you cannot simply sample the result of the join because the join itself may have imposed a kind of sampling. If you try to sample each part of the join, you are likely to get results that are non-random. In fact, it isn’t known whether it’s possible to generate a random sample of a join without first evaluating the join tree completely. And if you don’t have a random sample, determining confidence levels can prove to be a headache.
Infobright, a relative newcomer to Big Data analytics techniques, provides an alternative approach to traditional data sampling. It is interesting, innovative and from the Data Scientist’s perspective, compelling.
The Infobright Stratagem
I had a long and interesting conversation with Rick Glick of Infobright last week, during which he explained in detail how Infobright Approximate Query (IAQ) worked and where it could be applied. IAQ is analytics acceleration technology and as such, it is not specifically tied to Infobright’s database capability and could work with other database environments. When coupled with relaxing the exact answer requirement, IAQ proves to be a very powerful solution that provides a precise approximation at a rate 100-1000 times faster than traditional scale-out database solutions. It can also be applied to Machine Learning by means of data modeling and distribution.
What is “Precise Approximation”?
Let me point out that the degree of an approximation can vary. If the weatherman says that the temperature is 72 F, it is clearly an approximation. The exact temperature may be 72.17623 F, but the nearest degree is precise enough for the context. There are approximations in IT too. Computers use the Fast Fourier Transform approximation for many calculations (large integer multiplication, filtering algorithms and so on) for the sake of speed. Again, it is a precise approximation that is suitable for the context. This is what we mean by precise approximation. The approximation can be trusted completely within its context of usage but is nevertheless an approximation rather than an exact result.
Traditional Statistical Modeling
Let’s consider a fairly simple problem. We wish to explore a table of 10,000,000 customer records to identify patterns in the data. The usual statistical approach is to take a random sample of the data and investigate that instead. Statistics methods tell us that with a random sample of 16613 records we can have 99% confidence that our result will be within 1% of the answer we would get if we used all the data.
That will save significant computer time, but 99% confidence means that 1% of the time the result will be unacceptably wide of the mark. Imagine that we are investigating something simple such as which of our customers use which social media sites (say Facebook, Twitter, Google Plus+, LinkedIn, YouTube, Pinterest, Tumblr, Instagram, Reddit and Flickr).
We should not use such a small data sample for each social media site because the probability of the sample being unrepresentative of the true distribution of values increases with the number of different attributes we examine. We could compensate and increase the sample size, by a factor of 10, say. Then we could have 99% confidence that the answer will be within 0.31% of the answer we’d get using all the data. But doing that would chew up 10 times as much computer power.
Now consider a situation where we want to know about all the 10,000,000 customers, but particularly the 100 that spend the most with your company. If we take a random sample of even 166,130 records we are very likely to get an unrepresentative selection of the 100 we particularly care about. There is more than an 80% chance that none of those customers will appear in the sample. They are outliers. This is an example of the outlier problem we described earlier. When we examine an unfamiliar set of data with traditional data sampling, we can miss very important patterns in the data. Consequently, in complex explorations of data sets, the data scientists may take a much larger sample of the data, often as much as 10%. However, that does not necessarily solve the problem.
It is also worth noting that assumptions are usually made as to what constitutes a random selection of data. If the assumptions are wrong, then the sampling will provide skewed results. This starts to be a bigger challenge once data volumes grow large. At the multi-terabyte and petabyte level you will not want to read all the data just to establish a random sample because of the time it will take. However, if you just read a portion of the data, the data distribution within it may be skewed simply because of the way the data was captured.
Sampling activity is by no means as cut and dried as it may seem if you just read the text books.
Infobright’s Statistical Modeling
Let us now switch to what Infobright does and why it is different. Infobright Approximate Query (IAQ) is a database engine that models data by condensing data records of the same type into large chunks. When it does this, it holds histograms of the data values for each attribute (column) in the record. Think of the example we gave of customers and social networks. For each of the social network columns (value Y or N) Infobright holds a count (a two value histogram). For more variable attributes Infobright holds histograms for ranges. So for surnames it might hold a histogram that counted surnames beginning with A, surnames beginning with B and so on. For dates of birth it might base its histogram on ranges by year. For monetary amounts it might hold ranges of dollars, and so on.
This mechanism provides it with stunningly fast and remarkably accurate way of modeling the data. Instead of storing the atomic data, it generates statistical models based on all of the data, which takes up very little relative space. This makes it fast. What helps its accuracy is that there are many attributes: gender, state, qualifications, shoe size, nationality and so on, which are just categories for which Infobright will hold a completely accurate distribution of values. For most other attributes, like surname, date of birth, cost, price and so on, Infobright’s histogram will accurately reflect the distribution of data values.
If you run a Machine Learning Algorithm using IAQ, for example, some form of Linear Regression such as Lasso Regression or Ridge Regression, you will not get a precise answer since for some attributes, the IAQ histograms will give only an approximation of the distribution of values. Nevertheless, it is a precise approximation. If, for example there are a few outliers, their presence will be included.
An important detail of Infobright’s approach is that it provides an answer at least 100 times faster than if you read all the data directly. If we consider our simple sampling example, where just 16,613 records are read, in the time taken to read and examine those records Infobright will produce a much more accurate model on the whole 10,000,000 records, with time to spare. In fact, it would take less than two thirds of the time. In the time taken to do a 10% sample (1,000,000 records) IAQ will provide a more accurate result in one thousandth of the time.
If you consider the problem of sampling joined data that we described above, it is not a problem at all with Infobright, because it isn’t a sampling technique, it’s a precise approximation technique.
The Bottom Line
This is what makes Infobright so compelling. It is far faster than traditional statistical modeling while at the same time being far more accurate. In some situations, depending on what kind of analysis is being performed, its results will be as precise as is needed, and using all the data will cost more, take longer and fail to improve the result. In other situations, a full data scan may eventually be necessary. Either way, Infobright boosts the productivity of the Data Scientist.