Inside Analysis

From Data Mining to Big Data and Beyond

Author Gregory Piatetsky was briefied by Joydeep Das of Sybase, an SAP company, in a recent Briefing Room entitled, Predictive Analytics: A New Wealth of Options. Click here to view the archive.

This article looks at the changing trends in analyzing data and the popularity of different terms, such as statistics, data mining, knowledge discovery in data (KDD), predictive analytics, data science, and Big Data, and examine the gap between marketing and popular perceptions of these terms.

Learning from observations is a very deep-seated human trait. Our ancestors who knew how to avoid predators and search for food had better chances for survival and left us with an instinct for finding patterns in data. However, the methods that worked well for a small number of lions are not adequate for making inferences from a larger number of observations, and our intuition is not well-suited for statistical inferences – consider various superstitions, the popularity of astrology, or a lucky streak in gambling, etc.

More formal methods for learning from data appeared in 17th century after Blaise Pascal started to investigate a question from Chevalier de Méré about gambling. The Chevalier asked Pascal to investigate what was more likely: at least one 6 would appear during a total of four rolls of the dice or he would get a total of 12, or a double 6, on 24 rolls of two dice. Pascal found that the first approach was more likely (51.8% versus 49.1%), and this started a correspondence between Pascal and Pierre de Fermat, which led to the development of theory of probability. [http://www.teacherlink.org/content/math/interactive/probability/history/briefhistory/home.html]

Sir Ronald Fisher (1890-1962) is considered the founder of modern statistics. Some of his contributions include design of experiments, analysis of variance, and randomization testing – which was a good way to avoid finding patterns due to chance (a very common mistake in data mining).

In the 1960s, statisticians used the terms “data fishing” or “data dredging” to refer to what they considered to be a bad practice of analyzing data without a prior hypothesis.

The term “data mining” appeared around 1990 in the database community. Some started to use “database mining,”™ but found that this phrase was trademarked by HNC (now part of FICO) for its short-lived Database Mining Workstation. Other terms used at that time include data archaeology, information harvesting, information discovery, and knowledge extraction.

I coined the term “Knowledge Discovery in Databases” (KDD) for the first workshop on the same topic (1989) [ref] and this term became popular in academic and research community. However, the term “data mining” became more popular in the business community and in the press.

Figure 1. Statistics and Analytics in 20th Century taken from Google Ngram viewer (http://books.google.com/ngrams) which allows users to search for words and N-grams in books. This analysis uses only English language data. Other languages need to be included for fuller picture.

In 2003, the term “data mining” acquired a bad image because of its association with a U.S. government program called TIA (Total information Awareness). Headlines such as “Senate Kills Data Mining Program,” ComputerWorld, July 18, 2003, referring to a U.S. Senate decision to close down TIA, helped increase the negative image of data mining.

In 2006, the term “analytics” jumped to a great popularity, driven by the introduction of Google Analytics (December 2005) and later by a book Competing on Analytics: The New Science of Winning by Thomas H. Davenport and Jeanne G. Harris (March 2007).

Figure 2. Recent History, according to Google Ngram viewer, search for terms: knowledge discovery, data mining, and analytics with smoothing of zero. Google N-gram viewer is case sensitive, but we used lower-case versions of terms as most representative of the term popularity.

We observe that the term analytics has been used since 1980, but started to rise in 2005. The use of the term data mining jumps around 1996 (soon after the first KDD-95 conference) but declines after 2003 likely due to TIA controversy and associated in the popular press with government invasion of privacy. Knowledge discovery appears in 1989, jumps in 1996 – after the first KDD Conference in 1995 and publication of Advances in Knowledge Discovery and Data Mining (Fayyad, Piatetsky-Shapiro, Smyth, eds., 1996) and plateaus after 2000.

Google N-grams currently only has books until 2008, but we can get more recent data from Google Trends, which measures trends in Google searches for different terms.

 

Figure 3. Google Trends for Data Mining and Analytics

We observe that after at the end of 2005, searches for analytics spiked dramatically, about the same time as Google introduced Google Analytics. We also notice an increase around summer of 2007, which coincides with the release of Competing on Analytics. Other interesting features include sharp drops around December vacation time – even dedicated analysts need some time off!

However, analytics is a very broad term and there are many kinds of analytics. Looking deeper, we see that although there are relatively few news references to “Google analytics,” they constitute about 52% of all searches for “analytics.” This is an interesting example of what we can call a marketing gap – the difference between the use of the term in news articles and its popular use.

Figure 4. “Analytics” versus “Google Analytics”

Excluding “Google” from analytics, we see that there are about 2.75 times more searches for various kinds of analytics than for data mining. However, data mining mentions dominate the news reference volume.

Figure 5. News References to Term “Data Mining”

Comparing different type of analytics by search volume we observe: Business Analytics >> Predictive Analytics > Text Analytics.

Figure 6. Evolution of Terminology

Comparing data mining with business and predictive analytics, we see that searches for data mining are much larger than searches for either business analytics or predictive analytics, even in the last 12 months, but news references tend to use the analytics term more. The gap between searches for predictive analytics and data mining narrows from 1:152 to 1:14.5 when we limit the searches to the last 12 months.

Figure 7. Narrowing of the Gap

Figure 8. 1996 Poster Referencing Data Science

Two of the more recent terms are “data science” and “big data.” Although data science was used as long ago as 1996 (thanks to http://whatsthebigdata.com/2012/04/15/data-science-is-so-1996/ for the poster in Figure 8), only in the last two years has this term became popular after Jeff Hammerbacher (then at Facebook) and D.J. Patil (then at LinkedIn) reintroduced this term in 2009.

The big data term grew explosively starting around 2011. We see that the search volume for data mining is still larger, even in the last 12 months, but big data growth is very steep, especially in the 2012 news cycle.

Figure 9. Increased Use of Term “Big Data”

This is also confirmed by the huge growth of big data social media mentions (1,211% according to Visible Technologies) and by the growing number of industry-oriented meetings focused on big data, such as ACM/KDD, Strata, Data 2.0 Summit, Gartner, GigaOM, IEEE, IEG group, INFORMS, Predictive Analytics World as well as conferences from vendors, including EMC, SAS, IBM, and others – see www.kdnuggets.com/meetings/ page.

In several of these cases, we observe the difference between popularity of a term in the news and its popularity in Google searches – e.g., big data growth in 2012 in the news volume is much stronger than in the search volume. This is probably an indicator that the marketing and branding for this term are ahead of its popular perception. The reverse is true for data mining, where the level of searches is quite strong, although the term is not as frequently used in news releases.

Who’s Doing the Searching?

We can get regional information from Google trends, but a better source of information is Google insights. Comparing data mining, business analytics, and big data terms, we find that data mining is has high regional interest in India – not surprising, given the success of analytics outsourcing companies such as Opera Solutions and Mu Sigma, which have most of the analytics talent in India.

However, what is surprising is the countries that follow India in data mining interest: Kenya, Sri Lanka, Iran, and Taiwan. Business analytics has high regional interest in India, Singapore, U.S., Australia, and UK. Big data has high regional interest in India, South Korea, Singapore, Bulgaria, and the U.S.

The top 10 cities with the highest regional interest in big data are Bangalore, San Francisco, Mumbai, Singapore, New Delhi, New York, Sydney, LA, Toronto, and London.

Figure 10. Regional Interest in Big Data

In summary, we see that the process of analyzing data has been called by many different names, depending on various trends in business and marketing. New trends will emerge, and we can expect that the currently fashionable terms of data science and big data will also be replaced in a few years.

About the author: Gregory Piatetsky-Shapiro, Ph.D. is the President of KDnuggets, which provides analytic and data mining consulting. Previously, he led data mining groups at GTE Laboratories, Knowledge Stream Partners, and Xchange.  He has extensive experience in applying analytic and data mining methods to many areas including customer modeling, healthcare data analysis, fraud detection, bioinformatics, and web analytics, and worked for a number of leading banks, insurance companies, telcos, and pharmaceutical companies.

Gregory is also the Editor of KDnuggetsTM News, the leading newsletter on analytics and data mining, and the Editor of www.KDnuggets.com site, a top-ranked site for analytics and data mining, covering news, software, jobs, companies, courses, education, publications, and more. Follow  http://twitter.com/kdnuggets for latest updates.

Gregory coined the terms “KDD” and “Knowledge Discovery in Data” when he organized and chaired the first three workshops on KDD (Knowledge Discovery and Data Mining) in 1989, 1991, 1993.  These workshops later grew into KDD Conferences (www.kdd.org), currently the leading conference in the field.  Gregory was also a founding editor of the Data Mining and Knowledge Discovery Journal.

Gregory Piatetsky-Shapiro

About Gregory Piatetsky-Shapiro

Dr. Gregory Piatetsky-Shapiro, Ph.D., serves as the President and Editor of www.KDnuggets.com the leading newsletter and Web site for data mining and knowledge discovery, and an independent consultant focusing on Bioinformatics, CRM, and Business Analytics. Dr. Piatetsky-Shapiro is an internationally recognized expert in data mining and knowledge discovery. From March 1997 to March 2000, he served as a Vice-President and Chief Scientist at Knowledge Stream Partners, consulting and software development company which specialized in advanced data mining and customer analytics. Dr. Piatetsky-Shapiro is the founder of KDD Conference series. From 1985 to 1997, he served as a Principal Member of Technical Staff at GTE Laboratories where he started and led the first Knowledge Discovery in Databases project. He serves as a Director of the ACM SIGKDD, a professional organization of data miners. He has been a Member of Scientific Board of Advisors at Relevant Data Corp since May 2012. He serves as Member of the Scientific Board of KXEN, Inc. (also known as Knowledge Extraction Engines). He serves as Member of Scientific Advisory Committee of AnVil, Inc. He was a co-founder and past chair of SIGKDD association.

Gregory Piatetsky-Shapiro

About Gregory Piatetsky-Shapiro

Dr. Gregory Piatetsky-Shapiro, Ph.D., serves as the President and Editor of www.KDnuggets.com the leading newsletter and Web site for data mining and knowledge discovery, and an independent consultant focusing on Bioinformatics, CRM, and Business Analytics. Dr. Piatetsky-Shapiro is an internationally recognized expert in data mining and knowledge discovery. From March 1997 to March 2000, he served as a Vice-President and Chief Scientist at Knowledge Stream Partners, consulting and software development company which specialized in advanced data mining and customer analytics. Dr. Piatetsky-Shapiro is the founder of KDD Conference series. From 1985 to 1997, he served as a Principal Member of Technical Staff at GTE Laboratories where he started and led the first Knowledge Discovery in Databases project. He serves as a Director of the ACM SIGKDD, a professional organization of data miners. He has been a Member of Scientific Board of Advisors at Relevant Data Corp since May 2012. He serves as Member of the Scientific Board of KXEN, Inc. (also known as Knowledge Extraction Engines). He serves as Member of Scientific Advisory Committee of AnVil, Inc. He was a co-founder and past chair of SIGKDD association.

4 Responses to "From Data Mining to Big Data and Beyond"

  • Kevin Gray
    May 3, 2012 - 5:58 pm Reply

    Dear Gregory,
    Just a small historical note. In the statistical community I clearly recall “data mining” being used as a pejorative and more or less synonymously with “data dredging” and “data fishing” at least as far back as the early 1980’s. The term was associated with “shotgun empiricism.” In the mid-90’s, when I began to hear “data mining” used in the modern way, not surprisingly I was confused!
    Regards,
    Kevin

  • Gregory Piatetsky-Shapiro
    Gregory Piatetsky
    May 30, 2012 - 9:41 am Reply

    Kevin, thanks for the note. I think the amount of data increased sufficiently in 1990s to enable useful data mining, and people began to realize how to adjust for false discoveries. Also, sometimes we do not have any hypothesis and need the data to tell us what is happening

Leave a Reply

Your email address will not be published. Required fields are marked *