Go to Top

A Data Science Rant

What is data science? From the hype in the IT press right now, you might think that it is something excitingly new, destined to determine the future prosperity of a whole swathe of companies big and small.

Exciting it might be. New it is not.

Let’s cut to the chase here. This is a philology rant, not a technology rant. If you are already tired of the term “big data,” but not yet tired of the term “data science,” let me help you get there as swiftly as possible.

There is no such thing as “data science.” It is a solecism. These two words, when conjoined, are utterly misleading. We can start with the word “science.” Science is the systematic study of the world through observation and experiment. The scientific method is well known and reasonably well understood by most people. It involves inquiry based on empirical and measurable evidence. Scientists formulate theories on the basis of their observations, and they test them empirically. If the evidence supports whatever assertion they made, it becomes a credible theory – and it usually has predictive value.

If you’re asking, “Hey, isn’t that what a ‘data scientist’ does?” then in fact it may be exactly what someone with the newly minted title does. But still, there is no such thing as “data science.”

Science studies a particular domain, whether it be chemical, biological, physical or whatever. This gives us the sciences of chemistry, biology, physics, etc. Those who study such domains will gather data in one way or another, often by formulating experiments and taking readings. In other words, they gather data.

If there were a particular activity devoted to studying data, then there might be some virtue in the term “data science.” And indeed there is such an activity, and it already has a name: it is a branch of mathematics called statistics. It doesn’t need a name upgrade, or if it does, we should call it Statistics 2.0.

In the IT industry we are used to marketeers bending, folding, spindling and generally mutilating our language. That’s what they often do; in fact, that’s what they’re paid to do, and some do it well. In response we are supposed to discover the real meaning behind the words and behave rationally, despite the communication travesty.

I can live with that. What I have problems with is when the sabotage of meaning gets so egregious that people are very liable to misunderstand and subsequently exhibit suboptimal or even outright foolish behavior because of it.

8 Steps for Getting it Straight

So here’s a shot from the hip:

  1.  There is nothing new at all about what is being called “data science.” It is the application of statistics to specific activities.
  2.  We name sciences according to what is being studied, and the behavior involved is (or should be) along the lines of the scientific method. If what is being studied is business activity, and that’s usually the case, then it is not “data science,” it is business science. It is a language standard.
  3.  This statistical activity is identical to what we also call data analysis.
  4.  If you are interested in trying to work out what the budget for such activity should be, then you should not be thinking is terms of the usual ROI metrics that IT often relies upon. This scientific business activity is what we have known for decades as R&D – research and development of the business. The amount of money devoted to R&D is pretty much always a board level decision. If someone wants to initiate data analysis activity within a business, then they should talk to the board about the business’ approach to R&D.
  5.  None of this depends upon whether the data is big or not. Of course, if the data is big, then the IT resources required to carry out the scientific activity will be more expensive.
  6.  You will not be hiring a “data scientist” to carry this out. Here’s why: the combination of skills required to carry out these business science projects rarely reside in one person. Someone could indeed have attained extensive knowledge in the triple areas of what the business does, how to use statistics, and how to manage data and data flows. If so, he or she could indeed claim to be a business scientist (a.k.a., “data scientist”) in a given sector. But such individuals are almost as rare as hen’s teeth.
  7.  If you wish to develop such a capability, the sensible way to proceed is to put together a multi-disciplinary team of individuals with a set of well-defined goals who collectively possess the required skills. The one in charge should have a title like Project Director or Research Director. He is not obliged to wear a white coat.
  8.  In some organizations, the results of R&D are poorly implemented. This is an organizational problem. Think Xerox and Xerox PARC – truly great research leading to truly great products, except Xerox didn’t actually make those products. If you carry out wonderful business science and it doesn’t get implemented, it eventually will be. By your competitors.

I have said enough.

Robin Bloor

About Robin Bloor

Robin is co-founder and Chief Analyst of The Bloor Group. He has more than 30 years of experience in the world of data and information management. He is the creator of the Information-Oriented Architecture, which is to data what the SOA is to services. He is the author of several books including, The Electronic B@zaar, From the Silk Road to the eRoad; a book on e-commerce and three IT books in the Dummies series on SOA, Service Management and The Cloud. He is an international speaker on information management topics. As an analyst for Bloor Research and The Bloor Group, Robin has written scores of white papers, research reports and columns on a wide range of topics from database evaluation to networking options and comparisons to the enterprise in transition.

Robin Bloor

About Robin Bloor

Robin is co-founder and Chief Analyst of The Bloor Group. He has more than 30 years of experience in the world of data and information management. He is the creator of the Information-Oriented Architecture, which is to data what the SOA is to services. He is the author of several books including, The Electronic B@zaar, From the Silk Road to the eRoad; a book on e-commerce and three IT books in the Dummies series on SOA, Service Management and The Cloud. He is an international speaker on information management topics. As an analyst for Bloor Research and The Bloor Group, Robin has written scores of white papers, research reports and columns on a wide range of topics from database evaluation to networking options and comparisons to the enterprise in transition.

31 Responses to "A Data Science Rant"

  • Geoffrey Malafsky
    August 12, 2013 - 9:32 am Reply

    Interesting, enjoyable, true. Reminds me of what should be said in the first class of undergraduate “Science for non-scientists” which may unfortunately describe IT and data mgmt in their entirety.

  • Wayne Kurtz
    August 12, 2013 - 9:48 am Reply

    Robin,
    I couldn’t agree with you more. I also question if what Data Scientists do is really science. This is not to say the fruits of their labor may not be effective in increasing the competitiveness of the enterprises that employ them but I wonder if they really practice science. For example are their findings peer reviewed? Are their conclusions repeatable? Maybe, but these are two essential characteristics for an endeavor to be considered scientific. I think the nature of competitive commerce (not that scientific organizations aren’t competitive, they sure are!) would tend to inhibit the open publishing (with business results as proof of theory) needed to establish a “scientific” body of knowledge about “data”.

  • John Amiry
    August 12, 2013 - 10:59 am Reply

    Thank goodness someone understands the “emperor’s new clothes”.

    Thank you Robin.

    jsa

  • Arun
    August 12, 2013 - 11:11 am Reply

    While I generally agree with the essay, I don’t think machine learning is statistics; in general this role has to go beyond statistics. The science part comes about if and only if part of the job is validating that the results are real. Finding the ground truth (in the remote sensing connotation of the term) is a scientific job.

  • Dean Abbott
    Dean Abbott
    August 12, 2013 - 11:26 am Reply

    Thanks for the rant! I certainly agree there is nothing new here and you expressed this very well.

    The only quibble I have is characterizing Data Science as Statistics 2.0. Data Science tends away from statistics and more toward machine learning in my opinion. If you intended the “statistics” to be a catch-all for analytics, then ignore my comment. If statistics was intended to convey…statistics… then proceed!

    There is a sense in the data science community (from what I read) that big data and data science (which often go hand-in-hand) are driven by the data through induction rather than from a priori model assumptions. This is more of a machine learning than statistics approach to learning from data. One of my Five Predictive Analytics Pet Peeves (from my Predictive Analytics World talk this calendar year) relates to this very issue: data scientist only need lots of data. They don’t need domain expertise because the data speaks for itself. This kind of inductive modeling is the hallmark of techniques such as decision trees, naive bayes, and even neural networks of SVMs rather than regression. After all, with 100M records, everything is “statistically significant”, no?

    Thanks again for the rant. I’ve already tired of “big data” and on my twitter account, my description is “can be called a Data Scientist apparently, but choose not to be.” so I think we are sympatico on this one.

  • RnD
    August 12, 2013 - 11:44 am Reply

    Absolutely agree.
    When I try to explain to sales/marketing that what we do is measured by the economic value of information, I often get blank stares, or “that’s not a product we know how to sell”.
    Many people just want to know if we can make a ‘dashboard’ that will predict the future, and make me want to jump off a ledge.

    • Doug Laney
      September 2, 2013 - 9:05 pm Reply

      Gartner’s actual research on the role shows that it is quite distinct from statisticians or BI analysts: http://blogs.gartner.com/doug-laney/defining-and-differentiating-the-role-of-the-data-scientist/.

      Very impressed though that you’re traveling down the path of quantifying information’s value. Gartner’s “infonomics” research has resulted valuation models for our clients to measure information’s economic benefits, and a set of principles/disciplines for managing information as an actual (recognized) corporate asset. If interested in more on this see http://en.wikipedia.org/wiki/Infonomics for links to papers, articles, and the Infonomics LinkedIn group. FF to reach me via Twitter. –Doug Laney, VP Research, Gartner, @doug_laney

  • Cynthia Hauer
    August 12, 2013 - 12:25 pm Reply

    Robin,I agree with you in principle (and your rant WAS entertaining! I won the rant contest at EDW in May of this year addressing “why Big Data isn’t Big at all” – so you are preaching to the choir, when it comes to my perspectives and beliefs), I do . However, what is true is that many data managers do NOT interpret, analyze, characterize, or categorize *DATA* – they see it as an entity that is composed of bits and bytes that have to be moved down a fiber optics cable or in/from a cloud. As much as I don’t like the emphasis or the phrase “data science”, I like the clarification that it’s not just data in an application; it’s not just data in a system of record. It is data relevance, data context, metadata relationships – and sadly, not just tools. There is talent and effort required and exerted to use data correctly – and to provide it successfully to others who use it. So maybe the term “data science” conveys some additional responsibilities, knowledge, and commitment to the DM discipline and “catalog” – and we do need that.

  • Rob Klopp
    August 12, 2013 - 12:56 pm Reply

    Completely agree, Robin. I might add that “science” requires peer review of results. Many data science projects provide results that are not repeatable at best… skewed at worst.

  • Robin Bloor
    robinjamesbloorgroup
    August 12, 2013 - 1:55 pm Reply

    In response to some points made above:

    Machine learning (is not statistics): I think of Machine Learning as brute force statistics, in many of the techniques used. However some techniques, neural networks for example, go beyond the normal field of statistics. So, it’s a fair point.

    Science requires peer review: I agree, and I suspect that in time “data science” will adopt this as a necessary part of the activity. It is an auditing process, but no doubt it will be given a confusing name.

    Additional Responsibilities: Data management is, IMO, truly in the thick of this activity, but the appropriate data management discipline for this has yet to be well defined. It is not your father’s DM, because it requires proper data audit trails, metadata audit trails and even statistical model audit trails.

  • Sandy Steeir
    August 12, 2013 - 2:25 pm Reply

    Hear! Hear!

  • Dorothy Hewitt-Sanchez
    August 12, 2013 - 2:49 pm Reply

    So True and well said

    • Dorothy Hewitt-Sanchez
      August 16, 2013 - 10:19 am Reply

      This is only my opinion. I am not the expert.

      Data Science is the merging of computer science and applied mathematics to fulfill business requirements and reveals the present and future direction of the company/product.

      A Data Scientist is someone that has expertise in both fields combine with an understanding of the business requirements and uncovers patterns of the current and future path the company/product is traveling. They uncover corrective paths, new markets, and new plans with mathematical formulas to increase ROI.

      I think Data Science is a true field. Data Scientist – maybe-yes

      The data scientist crave may be short lived because vendors are creating these type models in the services they provide. So, I will wait and see what will happen. Not to be a wet blanket, but any field that requires so many skilled professionals is going to be automated as much as possible. However, I think a data scientist is still needed but maybe not the demand that some experts are saying. New applications and database features are being added daily to counter-act the shortage of skills. I hope the demand is true because Data Science is an exciting new field.

  • Wayne Kurtz
    August 12, 2013 - 4:03 pm Reply

    I would like to re-mention one aspect from my earlier comment. That is, repeatability. Typically when a scientist reaches a conclusion and claims that it supports a hypothesis, he/she does so by manipulating one variable across a well defined range of values, holding as many other variables as constant as possible, and observing a change in the behavior of the observed subject. Will data scientists be able to do this? Ultimately any advise to management based on data science conclusions will need to pass this test for the real value of data science to be realized. Especially if the action prescribed is automated, and I agree this is the ultimate objective of DS at this point. Otherwise why bother with the added cost of DS? Isn’t DS’s true goal to allow the CEO to trust decisions made by machines after they learn the “real world” contingencies, and ultimately release humans from having to grapple with large sets of data for which we are not “designed” to handle.

    • Robin Bloor
      robinjamesbloorgroup
      August 12, 2013 - 11:02 pm Reply

      Well yes, repeatability is, as you suggest an issue. But with data analysis, it may not be so simple, because the data scientist is also tasked with identifying trends and trends don’t persist forever. So the analogy between real science and data science fails to hold for such analytical conclusions. The data analyst builds a model, the model may be implemented, but its efficacy may not persist indefinitely. So there is a need to regularly audit the model to see if the basis of it persists. You could think of this as testing for repeatability.

      You have surfaced a possible distinction between science and data science. Thanks.

  • John Ternent
    August 15, 2013 - 8:32 am Reply

    Great post, Robin. It’s interesting that just because we use new tools, methodologies, algorithms, and even domains, we seem to be infatuated with changing the core of the analytic profession. I’m pretty sure farming, manufacturing, construction, and just about every other profession have evolved over time to use new tools, techniques, and methodologies, but a farmer is still a farmer and a neurologist is still a neurologist regardless.

  • Sean Golliher
    August 15, 2013 - 10:04 am Reply

    You’re correct that it’s not a science. I consider it just a job title: “data scientist”. I also agree that people shouldn’t trample on the work of statisticians. However, you’re not correct in that it’s just pure statistics. It requires an understanding of many topics from computer science: algorithms, machine learning, programming, data at scale, etc. – therefore a pure statistician won’t cut it.

    Realistically you’ll need a team of people to accomplish the tasks. Having a data scientists oversee all of it may be ideal. Ultimately we are trying to derive insight from data and develop some type of predictive model. If a predictive model can’t be derived telling us “what to do next” then there is no point in the work. Predictive analytics is another interesting/similar topic and there are conferences for it as well.

    Big data is an engineering/architecture problem and the differences between it and “data science” are clear. So I don’t think bringing that into the discussion makes sense.

    The type of person that understands all these topics is very rare and, therefore, they are highly recruited. There are many latching on to the title but very few really exist. A better title might be “data engineer”. However, as you described, data engineer isn’t sufficient for the desired hype.

    There is an overlap of skills here and the need for the position is real. It’s unfortunate that the title wasn’t thought about more carefully. At this point the job title is probably here to stay for a while.

    Here is a very complete list of other titles.

    http://www.datasciencecentral.com/profiles/blogs/job-titles-for-data-scientists

    • Sean Golliher
      August 15, 2013 - 10:30 am Reply

      I also found this interesting. There’s a “National Consortium for Data Science” with universities participating and trying to define what this field actually is: http://data2discovery.org/

  • Jon Bloom
    August 16, 2013 - 8:02 am Reply

    Agree. Data Science is new term applied to existing role(s). Report Writer, ETL, Business Intelligence, Statisticians, Business Analyst. Role needs a holistic understanding of the business, the technology and the analytics. The speed of change right now makes it impossible for a single person to know everything. Perhaps a “Chief Data Officer”, with a team of full timers, contractors, matrix people from other internal departments. To solve real world problems, find insights, find cost savings, increase customer base and provide value to the org. The ecosystem is based on data.

  • Carla Gentry (@data_nerd)
    August 19, 2013 - 6:55 am Reply

    Data Science is real and it’s not just two words put together -> The first known modern reference to the term ‘data science’ to mean as such was a paper by William Cleveland from Bell Labs in 2001: http://cm.bell-labs.com/cm/ms/departments/sia/doc/datascience.pdf

    There is a lot more science going on that you are stating, many years of College and a logical or mathematical background make a great foundation for Data Science but I do agree, this is nothing new. I’ve been at it for 15+ years and don’t see many names from the ASA above. Data Science is more that just analysis and statistics, it is understanding data structures, understand how to clean and glean insight…. I can go on and on but hope you read some more information on Data Science before totally ranting on a field that has been bread and butter for a lot of talented people for decades. Have a terrific week!

    • Doug Laney
      September 2, 2013 - 8:57 pm Reply

      Well put Carla. Those who rant about the term, seem not to have actually researched it. –Doug Laney, VP Research, Gartner, @doug_laney

  • Meta Brown
    August 19, 2013 - 12:51 pm Reply

    Nice rant!

    You’ve actually been quite restrained with your criticism. It’s distressing to me to see how often those who come into data science roles from the programming side end up reinventing the statistics wheel. Of course, their new, superior wheel might not be round…

    And you’re so right about using teams! I expressed similar sentiments here:

    It’s a Bird! It’s a Plane! No, It’s Just a Data Scientist.
    http://bit.ly/smartdata024

    Meta Brown

  • MG
    August 27, 2013 - 8:50 am Reply

    “Data science” is to me a practical way of gathering under a common name at least two research communities : machine learning and data mining, on one side, and databases/data managnement, on the other side. Who has been reading VLDB, KDD, IEEE-PAMI and so on for 15 years, who publishes in all of these communities, who examines the overlap between the topics in the CFP and the proceedings, cannot just dismliss it as “fashionnable statistics”. There is no strict boundary to all this, but there is a consistent area there (there has been for ages, but nicely it now has a name, whatever the limitations of this name).

    The guy who is working on LSH, for instance, is that a database guy ? a statistics guy ? an informatique retrieval guy ? Oh dear….

  • Doug Laney
    September 2, 2013 - 8:56 pm Reply

    Good thing there are some of us out here doing actual research on the term “data scientist” (ostensibly using a bit of data science). There is a *clear* a distinction between “data scientists”, “BI analysts” and “statisticians” according to Gartner’s analysis of hundreds of job postings for each of the three. See a summary of this research: http://blogs.gartner.com/doug-laney/defining-and-differentiating-the-role-of-the-data-scientist/ –Doug Laney, VP Research, Gartner, @doug_laney

  • Hahnemann Ortiz
    September 12, 2013 - 10:32 am Reply

    Although it is true that statistical methods are used by data scientists, a data scientist, depending on the problem, will apply algorithms or even machine learning techniques that have nothing to do with statistics, so you can’t label a data scientist as a statistician for this simple reason.

    The word “science” is used depending on academic context. “Political Science” for example, studies states, governments. and of course politics. So the reality is that a data scientist studies data, but for different reasons than a data analyst or statistician. However, data science is not considered an academic discipline (yet) but universities have started data science courses, and journals and conferences exist.

    I think it’s a matter of time before academics get together and decide how to correctly label data science and under what college should go to. My vote goes for Computer Science.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>