Inside Analysis

Data Profiling – Four Steps to Knowing Your Big Data

“Know thy data” is one of the fundamental principles of sound data science.1 Another name for this is data profiling. The article “Big Data – Naughty or Nice?” listed six foundational concepts of data science.2 Along with #2 “Know thy data,” the article listed five other data science “commandments”: 1) Begin with the end in mind; 3) Remember that this *is* science; 4) Data is never perfect, but love your data anyway; 5) Overfitting is a sin against data science; and 6) Honor thy data’s first mile and last mile. We expand on data profiling here by elucidating the following four steps toward knowing your data: data preview and selection; data cleansing and preparation; feature selection; and data typing for normalization and transformation.

Author Kirk Borne joins John Santaferraro of Actian in The Briefing Room to discuss the process for building analytical insights.

Data Preview and Selection

Knowledge of your data begins with a thorough preview of the good, bad and ugly parts of your data collection, and it ultimately leads to a decision about which portions of the data set you will select for your data science analysis. This activity includes checking numeric attributes to see if their values are within the expected range (e.g., the minimum and maximum values are sensible for each attribute). It also includes examining the set of unique values for discrete categorical attributes (to see if the values match your expectations) – e.g., if an attribute is an address field, does the value have the format of an address? If the address includes a postal code (ZIPcode), is the code in the right format? If the attribute is a class label, is there only one unique spelling for the label or are there multiple misspellings?

Invoking the simple SQL command “SELECT UNIQUE *” can uncover surprising results in real-world databases. Summarizations and aggregations of the data can also be informative (e.g., by executing the SQL “GROUP BY” aggregate function on your database). For example, with continuous numeric data or discrete non-numeric data values, a data distribution (histogram) plot can be quite revealing: Does the data histogram have a reasonable distribution? Does it have a long tail? Is it symmetric? Furthermore, multiple X-Y scatter plots of many different attribute pairs present simple visual summarizations of the data that can quickly expose outliers, hot spots, trends or even degenerate attributes, in which either all of the entries are the same for a given attribute or else the values for two different attributes are perfectly correlated (e.g., a list of birthdates for a million customers has a perfect one-to-one correlation with the list of customers’ ages). In addition, if possible, performing external checks through cross-database comparisons (where some of the data values may be replicated) can be used to verify the consistency of the data values.

Data Cleansing and Preparation

After you have selected the attributes for your analytics project, then you must prepare and clean those data values for use. Textbook authors frequently inform us that 40-60% of the time spent on a data analytics project is spent in cleaning and preparation. Most of us who have done this long enough will know that such an estimate can be very wrong – in fact, it is not unusual for 90-95% of the time to be spent in data cleaning and prep. This is not “time wasted.” On the contrary, this is time “well invested” – you will develop greater confidence in and advocacy for your analytics project results by working with the cleanest possible data. (Note: some projects focus specifically on the outliers, the novel and surprising parts of the database – intentionally searching for anomalous behavior – in these cases, you most emphatically want to be certain that anomalous data values are properties of the objects being studied and not artifacts of your data collection process or data pipeline.)

During the data clean and prep stage, you may be surprised to discover that you will do more data profiling than in the data preview stage. For example, you will discover many aspects of your data that will need to be “handled” in some way (cleaned or removed or “fixed”), including: NULL values, missing values, errors, noise or unexpected data artifacts. Data prep also includes data normalizations and transformations, which are discussed separately below, since those activities often require subject matter and domain expertise (e.g., converting an attribute into some specific physical units or creating a new explanatory variable from the ratio of two specific attributes, or transforming an IP address into a geo-location, such as a latitude/longitude pair).

Feature Selection

After selecting and then cleaning data attributes for use, it is time to select the subset of attributes that will be used for each question (i.e., data science hypothesis) that you will pose against your data. Different questions require different attributes in order to reach the most accurate answer. At this stage of data profiling, you select the inputs (feature vector attributes) that will be fed into your data science tasks (e.g., predictive analytics, segmentation, recommendations or link analysis). Selecting the most informative and predictive attributes is critical to the success of that activity. The feature vector may contain a very small subset of the total set of data attributes – such parsimonious models are frequently preferred (i.e., if you cannot explain the behavior of your customers with simple, transparent, explainable models, then who will believe a very complex model?).

Some models are necessarily much more complex, such as the recommendation algorithm that won the Netflix $1million challenge,3 and feature selection is even more critical (to avoid excessively bloated models). Nevertheless, it is essential to select different combinations of explanatory variables for different analytics questions (e.g., this is a defining characteristic of the random forests algorithm).

Data Typing for Normalization and Transformation

Practical use of data science algorithms requires further adjustments of data values, particularly for attributes that have units (e.g., monetary, physical, temporal, spatial). Converting such attributes to dimensionless units (e.g., dividing by a characteristic value of the attribute) or converting a set of similar attributes (e.g., currencies) to a common unit enables the algorithm to discover real trends and patterns in your data instead of spurious correlations caused by simple inconsistencies in units. Similarly, it is often very convenient and scientifically sensible to use normalized values, for example: scale all numerical values from min-to-max to 0-to-1 scale, or 0-to-100 scale; scale values to zero mean and unit variance;convert discrete categorical data into numeric values or ranked lists (which works particularly well with ordinal [ordered] data values); or discretize continuous data into bins. In cases of scaling, many different attributes are consequently weighted democratically in the model, instead of giving unfair weight to attributes that naturally have large numerical values. This is especially important in distance (or similarity) metric calculations, where one attribute can dominate and skew the metric calculation unfairly. Finally, data typing is important when using algorithms that expect a certain type of data input, such as: continuous numeric data for regression models; discrete data values for association or link analysis; binary data for logistic regression; or sequential data for Markov models. Thorough knowledge of your data informs good data science models. Ultimately, data profiling is the best path to “knowing thy data” for your analytics project.

References

1 http://www.statisticsviews.com/details/feature/5459931/Five-Fundamental-Concepts-of-Data-Science.html
2 http://www.datasciencecentral.com/profiles/blogs/big-data-naughty-or-nice
3 http://www.techdirt.com/blog/innovation/articles/20120409/03412518422/why-netflix-never-implemented-algorithm-that-won-netflix-1-million-challenge.shtml
4 http://www.dataminingblog.com/standardization-vs-normalization/

Dr. Kirk D. Borne

About Dr. Kirk D. Borne

Dr. Kirk D. Borne is a Transdisciplinary Data Scientist and an Astrophysicist. He is Professor of Astrophysics and Computational Science at George Mason University. He has a B.S. degree in physics from LSU and a Ph.D. in astronomy from Caltech. He has been at Mason since 2003, where he does research, teaches, and advises students in the Data Science program. Previously, he spent nearly 20 years in positions supporting NASA projects, including an assignment as NASA's Data Archive Project Scientist for the Hubble Space Telescope, and as Project Manager in NASA's Space Science Data Operations Office. He has extensive experience in big data and data science, including expertise in scientific data mining and data systems. He is on the editorial boards of several scientific research journals and is an officer in several national and international professional societies devoted to data science, data mining, and statistics. He has published over 200 articles (research papers, conference papers, and book chapters), and given over 200 invited talks at conferences and universities worldwide. In these roles, he focuses on achieving big discoveries from big data through data science, and he promotes the use of information and data-centric experiences with big data in the STEM education pipeline at all levels. He believes in data literacy for all! Learn more about him at http://kirkborne.net/ and follow him on Twitter at @KirkDBorne, where he has been identified as one of the social network’s top big data influencers.

Dr. Kirk D. Borne

About Dr. Kirk D. Borne

Dr. Kirk D. Borne is a Transdisciplinary Data Scientist and an Astrophysicist. He is Professor of Astrophysics and Computational Science at George Mason University. He has a B.S. degree in physics from LSU and a Ph.D. in astronomy from Caltech. He has been at Mason since 2003, where he does research, teaches, and advises students in the Data Science program. Previously, he spent nearly 20 years in positions supporting NASA projects, including an assignment as NASA's Data Archive Project Scientist for the Hubble Space Telescope, and as Project Manager in NASA's Space Science Data Operations Office. He has extensive experience in big data and data science, including expertise in scientific data mining and data systems. He is on the editorial boards of several scientific research journals and is an officer in several national and international professional societies devoted to data science, data mining, and statistics. He has published over 200 articles (research papers, conference papers, and book chapters), and given over 200 invited talks at conferences and universities worldwide. In these roles, he focuses on achieving big discoveries from big data through data science, and he promotes the use of information and data-centric experiences with big data in the STEM education pipeline at all levels. He believes in data literacy for all! Learn more about him at http://kirkborne.net/ and follow him on Twitter at @KirkDBorne, where he has been identified as one of the social network’s top big data influencers.

14 Responses to "Data Profiling – Four Steps to Knowing Your Big Data"

  • Geoffrey Malafsky
    February 17, 2014 - 11:30 am Reply

    This is an excellent exposition of the issues and scientific approach. I applaud it and hope people follow its prescriptions as it will greatly help them. There is another side to the challenge though, one that we deal with Data Normalization, which is the lack of domain (i.e. business) knowledge to adjudicate conflicts in data coming from multiple sources. This is very common in corporate data environments which is why we concentrate on it with our DataStar Normalization platform. There is a large need for discovery and low cycle time adjustments as visibility leads to awareness and understanding, to deliberation, to decision making, to data changes, to end result modifications. This is more akin to basic science than engineering, the latter doing forward design based on the assumption that the data feeding it is appropriate. In basic science, no such assurance can be made and thus it is critical to fully document all collection methods, assumptions, settings, and analysis methods so they can be revisited at a later time. For business oriented uses of Data Science, I believe this is very important as is the idea that there must be a dramatic lowering of the time from insight into production. This means eliminating the many months spent on changes to data models and data integration with new more flexible methods.

  • Brand Niemann
    February 18, 2014 - 7:19 am Reply

    The new book Data Science for Business, used at NYU and by more than twenty other universities for programs in nine countries (and counting), in business schools, in computer science programs, and for more general introductions to data science, uses a useful codification of the data mining process given by the Cross Industry Standard Process for Data Mining (CRISP-DM; Shearer, 2000). See: http://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining

    Data Understanding and Data Preparation are two of the 5 steps after Business Understanding. But getting to Modeling, Evaluation and Deployment are the rest for successful Data Science for Business. I have illustrated these for several real-world business problems in tutorials that are presented at the new Federal Big Data Working Group Meetup (see Web Site Address).

    Data Science for Business also makes the point that Data Understanding is understanding the strengths and limitations of the data because rarely is there an exact match with the business problem. Some very successful data science for business projects involve the collection of new data (e.g. Capitol One) and where the data is “standardized /normailzed” by semantic technologies like Be Informed (Dutch government applications and our recent Healthcare.gov pilot).

    Data Science for Business concludes:
    If you are a business stakeholder rather than a data scientist, don’t let so-called data scientists bamboozle you with jargon: the concepts of this book plus knowledge of your own business and data systems should allow you to understand 80% or more of the data science at a reasonable enough level to be productive for your business. After having read this book, if you don’t understand what a data scientist is talking about, be wary. There are of course many, many more complex concepts in data science, but a good data scientist should be able to describe the fundamentals of the problem and its solution at the level and in the terms of this book.

    If you are a data scientist, take this as our challenge: think deeply about exactly why your work is relevant to helping the business and be able to present it as such.

    If you can’t explain it simply, you don’t understand it well enough.
    —Albert Einstein

  • Richard Ordowich
    February 19, 2014 - 8:23 am Reply

    Data profiling is a revealing process but not sufficient to understand the data. It is necessary to look at the structure of the data, as well as the semantics. These three processes provide a comprehensive view of the data and frequently reveal many characteristics that are not evident in just profiling.

    A more comprehensive examination includes examining the taxonomy and ontology of the data and the data syntax.

    This is Data Literacy. A skill I have discovered is lacking in those that work with data, including data scientists (not sure what a data scientist is except a marketing term).

  • Dr. Kirk D. Borne
    Kirk Borne
    February 19, 2014 - 6:41 pm Reply

    Richard: you are exactly right. I couldn’t get into the semantics of data with a 1000-word limit constraint on my published blog length, but I do teach my students all about the importance and applications of semantics, ontologies, taxonomies, folksonomies, relationships within data, etc. The context and meaning of data is far more important than the data types and range of allowed values. Nevertheless, working with good clean data, plus knowing all about the data you have, is still a very essential starting point for any new analytics project, especially in the era of big data, where you may have little knowledge of all that data. Data Literacy for all!

    • Richard Ordowich
      February 20, 2014 - 7:43 am Reply

      Kirk, I look forward to the new breed of data professionals who are data literate. Most of what I see in industry and government are Data Pushers. Society has created a new occupation that reminds me of underground mining but it is a white collar job, mining data. And like natural resource mining, automation and innovation will reduce the need for these data pushers. However, those who are data literate will have a role to play.

      At the senior ranks such as CTO and CIO, I find a similar lack of understanding of the fundamentals of data, yet these folks claim to be information experts.

      I witnessed a similar lack of rigor in software development and decided to take action working with a colleague whose PhD thesis was on a new curriculum for computer programming focused on problem solving based on the works of George Polya. rather than learning a programming language.

      Another area that I suggest needs to be taught is philosophy. It is interesting to read the debates and arguments about topics such as identity management from history. Writings by Luciano Floridi I have found to be particularly helpful

      Many of the data quality issues we face today are as a result of the lack of attention to data design. Not data modeling but the design of data elements applying techniques such as semantics, taxonomy etc. Most enumerated reference data I see is inconsistent and incomprehensible. A data mashup.

      Data should tell a story and like a good story, the words, the sentences and the context are critical. I think the best that can be said about most of today’s data is that it is at best graffiti and at worst, gibberish. Using this data for analytics or decision making is unwise and risky.

      This was described in Don Marchand’s book Information Orientation when he researched the use of data by senior managers when making decisions. Most managers ignored the data because they did not trust it and the fact that much of it was irrelevant. This flies in the face of those who profess a data warehouse is a meaningful source of business intelligence. Mining data without understanding it and coming up with little to show for it. Except of course the random discovery which is then is used to justify the significant investments in technology.

      I suggest a required reading by those working with data is Raw Data is an Oxymoron. I enjoyed the quote “all data lies”. Something all those who work with data should keep in mind.

      • Dr. Kirk D. Borne
        Kirk Borne
        February 20, 2014 - 8:16 am Reply

        Richard, thanks for those insights. Speaking of data literacy, that quote “all data lies” should be “all data lie”. 🙂 Who wrote that book anyway?

        • Richard Ordowich
          February 20, 2014 - 12:29 pm Reply

          Kirk, sorry for the grammatical error. Grammar was never my strong point and it has come back to haunt me in the data world :). I should not be talking about data literacy when I suffer from this limitation 🙂

          The quote was in the book Raw Data is an Oxymoron by Lisa Gitelman.

          • Dr. Kirk D. Borne
            Kirk Borne
            February 20, 2014 - 5:01 pm

            Great! Thanks! 🙂

  • Dr. Kirk D. Borne
    Kirk Borne
    February 19, 2014 - 6:42 pm Reply

    Brand, thanks so much for your detailed comments. The CRISP standard is an excellent reference!

  • Dr. Kirk D. Borne
    Kirk Borne
    February 19, 2014 - 6:45 pm Reply

    Geoffrey, thanks for your comments. It is definitely true that good reporting and documentation of every step of the data preparation phase is essential. When data are being transformed and/or normalized, the values may bear no resemblance to the input data, and so it is imperative to include metadata (and provenance information) in the process chain at all times.

  • Riddhi Ranjan Dutta
    July 15, 2014 - 5:13 am Reply

    Hello Kirk

    This is an excellent article. It covers all the important aspects of data profiling in a very intuitive manner. Just one observation is that during data normalization, some of the data attributes tend to lose out on data sensitivity primarily because of the range in which they vary. This may have a major impact in the mathematical model that may follow the profiling exercise. The scaling exercise, infact, takes care of this loss of data sensitivity for certain attributes. The weights are, in most of the cases “Ratios”.

    Just a few thoughts. Let me know if this is helpful.

    Best regards
    Riddhi

  • Oscar Tong
    July 15, 2015 - 1:57 pm Reply

    The problem these days is that too many new big-data workers don’t know the difference between data profiling and data discovery. I wasn’t actively aware of this issue until I read an article about it, and ever since, I’ve noticed it all over the place. So my one recommendation is to make sure everyone well aware of the differences.

Leave a Reply

Your email address will not be published. Required fields are marked *