I’ve already given my opinion of the term “data science,” but in truth that posting was just a complaint about the use of terminology. I didn’t suggest that data science was not a sensible activity for a business to pursue. In fact, it is a very valuable activity, and perhaps there are fewer companies showing an interest than ought to be the case. So here I’ll promote that idea, by discussing ten reasons that lie behind the momentum that data science has acquired.
- Improving Hardware Technology: Hardware continues to get faster. It just does – even though we now seem to be heading toward an era when it will be increasingly difficult to get more power from the silicon. According to respected commentators and experts, CPU capabilities will continue to increase for at least another eight years or so. The problem with big data is that it generates big workloads. Nevertheless, as far as I can tell, the improvement in hardware in the near term at least will continue to push the boundaries on big data.
- The Open Source Movement: In many ways, big data and the data science that goes with it have been facilitated by the Open Source Movement. Fundamentally, the Open Source Movement is an alternative style and business model for developing useful software tools and capabilities which lead to fast and inexpensive adoption. This has had a distinct and welcome impact. In respect of big data and data science, it had made the cost of entry much lower than it would otherwise be, because there have been many open source contributions to this area of IT.
- Hadoop: Hadoop is, of course, an open source product. However, its importance to data analysis is not explained by that alone. The primary contribution of a Hadoop is that it has provided a scale-out data platform that is schema-free. Because of Hadoop there is no need to do immediate data modeling work if you want to collect data and both cleanse and transform it for use on data science projects. Of course, Hadoop’s scalability is naturally also of great importance.
- The Data Analyst Language – R: The R language, another open source contribution, has become the de facto programming language for data analysis. As such it has become a kind of usage-driven standard which in itself encourages analytical activity, modeling and prototyping. It now boasts 5 million users, many of them in academia, and it seems to be encouraging analytical investigation in many areas of data.
- Parallelism: Around 2004, CPU manufacturers were no longer able to increase chip speeds by increasing clock speeds, because the CPU ran too hot. Consequently, they resorted to adding extra cores to the chips they manufactured, and thus began the “age of parallelism.” Of course, it took time to gather any momentum. Few software products had ever been built to run in a parallel manner and few programmers knew how to code for parallelism. The most visible use of parallelism is by the MapReduce algorithm which Hadoop runs. However, there are less visible products like Actian’s DataRush, SQL Stream and, actually, quite a few others, which effectively implement parallel capability. The impact of this is, simply, that we can process more data in less time.
- The Cloud: Because of the cloud it is possible to configure and prototype big data projects with very little effort and almost no wait time. Even if you intend, at some later point, to bring the processing back in-house, the cloud simply accelerates data analytics projects. There has been some nervousness about data security in the cloud, but as it is usually a simple matter to anonymize data, it is rarely a major stumbling block. The cloud can make data analytics projects happen faster.
- Embedded Analytical Functionality: One important impact of the enthusiasm for data science is that quite a few databases have added a host analytical functions and capabilities, so that the database user is able to include analytical calculations within SQL commands, or by other means. This acts as an enabling capability for analytics. Analytic databases such as 1010data already existed, but this trend broadens the options for fast analytic processing. The point is that the database’s optimizer will (if well written) now optimize for the combination of SQL queries with analytical calculations.
- Machine Learning Algorithms: Machine learning algorithms have been around for quite a while. Their limited usage on large volumes of data was a performance constraint. Now that the constraint has diminished their use has increased. An important impact of this is that they can be employed to simplify the data analyst task with a fairly high degree of confidence. As a consequence it becomes possible for those who have less knowledge of statistics (think business analysts) to engage in useful analytical activity. Machine learning algorithms are now available from many sources, including the Mahout and Knime, both open source products. Actian and Datameer are doing some interesting things in this area as well.
- Machine Generated Data & Data Sharing: Two of the main sources of big data are social network data (Twitter, Facebook, et. al.) and machine generated data (from log files, mainly). In the future a major source will undoubtedly be “the internet of things,” which will boost the amount of machine generated data available by petabytes. Additionally, the advent of Hadoop has made it far easier to manage unstructured data (i.e., non-formally structured data) and combine it with structured data for analysis purposes. In effect the pool of data that can be analyzed has grown considerably and will certainly continue to grow.
- There Is Gold In The Data: A presenter at a conference I recently attended noted that 80% of the big data projects were attacking the same analytical problems that had previously dominated the field of data analytics, i.e., trading, banking (fraud), telecoms, pharmaceuticals, retail, web businesses, computer security and so on. The difference in these areas was that companies could do more detailed analysis and/or get results faster. However that still leaves about 20% as genuinely new areas of analyst activity – and these are early days. As more success stories emerge, more businesses will be inclined to pursue data science.
So that provides the backdrop to what is currently the hottest area of corporate IT. Data analysts have been working for years to extract gold from data. The combination of the above ten factors simply means that the activity has been strongly enabled by technology with the result that it can be carried out less expensively on more data than before and, in many instances, will deliver results faster. I expect this trend to run for at least five years – maybe ten.
There is, after all, gold in the data.