“Know thy data” is one of the fundamental principles of sound data science.1 Another name for this is data profiling. The article “Big Data – Naughty or Nice?” listed six foundational concepts of data science.2 Along with #2 “Know thy data,” the article listed five other data science “commandments”: 1) Begin with the end in mind; 3) Remember that this *is* science; 4) Data is never perfect, but love your data anyway; 5) Overfitting is a sin against data science; and 6) Honor thy data’s first mile and last mile. We expand on data profiling here by elucidating the following four steps toward knowing your data: data preview and selection; data cleansing and preparation; feature selection; and data typing for normalization and transformation.
Author Kirk Borne joins John Santaferraro of Actian in The Briefing Room to discuss the process for building analytical insights.
Data Preview and Selection
Knowledge of your data begins with a thorough preview of the good, bad and ugly parts of your data collection, and it ultimately leads to a decision about which portions of the data set you will select for your data science analysis. This activity includes checking numeric attributes to see if their values are within the expected range (e.g., the minimum and maximum values are sensible for each attribute). It also includes examining the set of unique values for discrete categorical attributes (to see if the values match your expectations) – e.g., if an attribute is an address field, does the value have the format of an address? If the address includes a postal code (ZIPcode), is the code in the right format? If the attribute is a class label, is there only one unique spelling for the label or are there multiple misspellings?
Invoking the simple SQL command “SELECT UNIQUE *” can uncover surprising results in real-world databases. Summarizations and aggregations of the data can also be informative (e.g., by executing the SQL “GROUP BY” aggregate function on your database). For example, with continuous numeric data or discrete non-numeric data values, a data distribution (histogram) plot can be quite revealing: Does the data histogram have a reasonable distribution? Does it have a long tail? Is it symmetric? Furthermore, multiple X-Y scatter plots of many different attribute pairs present simple visual summarizations of the data that can quickly expose outliers, hot spots, trends or even degenerate attributes, in which either all of the entries are the same for a given attribute or else the values for two different attributes are perfectly correlated (e.g., a list of birthdates for a million customers has a perfect one-to-one correlation with the list of customers’ ages). In addition, if possible, performing external checks through cross-database comparisons (where some of the data values may be replicated) can be used to verify the consistency of the data values.
Data Cleansing and Preparation
After you have selected the attributes for your analytics project, then you must prepare and clean those data values for use. Textbook authors frequently inform us that 40-60% of the time spent on a data analytics project is spent in cleaning and preparation. Most of us who have done this long enough will know that such an estimate can be very wrong – in fact, it is not unusual for 90-95% of the time to be spent in data cleaning and prep. This is not “time wasted.” On the contrary, this is time “well invested” – you will develop greater confidence in and advocacy for your analytics project results by working with the cleanest possible data. (Note: some projects focus specifically on the outliers, the novel and surprising parts of the database – intentionally searching for anomalous behavior – in these cases, you most emphatically want to be certain that anomalous data values are properties of the objects being studied and not artifacts of your data collection process or data pipeline.)
During the data clean and prep stage, you may be surprised to discover that you will do more data profiling than in the data preview stage. For example, you will discover many aspects of your data that will need to be “handled” in some way (cleaned or removed or “fixed”), including: NULL values, missing values, errors, noise or unexpected data artifacts. Data prep also includes data normalizations and transformations, which are discussed separately below, since those activities often require subject matter and domain expertise (e.g., converting an attribute into some specific physical units or creating a new explanatory variable from the ratio of two specific attributes, or transforming an IP address into a geo-location, such as a latitude/longitude pair).
After selecting and then cleaning data attributes for use, it is time to select the subset of attributes that will be used for each question (i.e., data science hypothesis) that you will pose against your data. Different questions require different attributes in order to reach the most accurate answer. At this stage of data profiling, you select the inputs (feature vector attributes) that will be fed into your data science tasks (e.g., predictive analytics, segmentation, recommendations or link analysis). Selecting the most informative and predictive attributes is critical to the success of that activity. The feature vector may contain a very small subset of the total set of data attributes – such parsimonious models are frequently preferred (i.e., if you cannot explain the behavior of your customers with simple, transparent, explainable models, then who will believe a very complex model?).
Some models are necessarily much more complex, such as the recommendation algorithm that won the Netflix $1million challenge,3 and feature selection is even more critical (to avoid excessively bloated models). Nevertheless, it is essential to select different combinations of explanatory variables for different analytics questions (e.g., this is a defining characteristic of the random forests algorithm).
Data Typing for Normalization and Transformation
Practical use of data science algorithms requires further adjustments of data values, particularly for attributes that have units (e.g., monetary, physical, temporal, spatial). Converting such attributes to dimensionless units (e.g., dividing by a characteristic value of the attribute) or converting a set of similar attributes (e.g., currencies) to a common unit enables the algorithm to discover real trends and patterns in your data instead of spurious correlations caused by simple inconsistencies in units. Similarly, it is often very convenient and scientifically sensible to use normalized values, for example: scale all numerical values from min-to-max to 0-to-1 scale, or 0-to-100 scale; scale values to zero mean and unit variance;4 convert discrete categorical data into numeric values or ranked lists (which works particularly well with ordinal [ordered] data values); or discretize continuous data into bins. In cases of scaling, many different attributes are consequently weighted democratically in the model, instead of giving unfair weight to attributes that naturally have large numerical values. This is especially important in distance (or similarity) metric calculations, where one attribute can dominate and skew the metric calculation unfairly. Finally, data typing is important when using algorithms that expect a certain type of data input, such as: continuous numeric data for regression models; discrete data values for association or link analysis; binary data for logistic regression; or sequential data for Markov models. Thorough knowledge of your data informs good data science models. Ultimately, data profiling is the best path to “knowing thy data” for your analytics project.