A common refrain among the data science community is that existing business analysts are too burdened with their assumptions and bias from their existing work to be effective with big data, working with predominately internal data, usually at some level of aggregation as well. Data scientists, the argument goes, are more “scientific” and use machine learning algorithms to allow the data to speak for itself. Neither of these arguments is particularly valid. No matter how much machine learning is employed, at some point the data scientist has to build a model and at this point their experience and “bias” come into play. The business analysts either will or won’t adapt to a more data-driven point of view.
Hear Author Neil Raden in this archive of The Briefing Room with John Santaferraro of Actian.
Example: The marketing department has always used regression analysis for forecasting. A “data scientist” would regard this as a potentially misleading bias and would prefer a model-less solution such as machine learning to let the data “speak for itself.” This represents a shift from statistics to algorithmics. But bias and assumptions creep into every model. Machine learning is a not a crystal ball and still requires the data scientist to trim the number of variables or to interpret intermediate results before starting another iteration.
Many of the influencers writing and speaking about big data are coming from an industry that is quite different than most. Because much of their work has been done by hand (writing programs for Hadoop or the R framework), they are unfamiliar with the vast amount of technology already in place from the data warehousing and business intelligence providers. In fact, many of those tools are far superior and more appropriate for an enterprise environment. In many cases, it may make more sense to add some capabilities to those tools than to abandon them in favor of a universe of open source and do-it-yourself programs.
One obvious and straightforward way to do that is for analytical platform vendors to add quantitative tools and applets to their products. By making these capabilities accessible though a familiar interface (SQL) and packaging them for use by those with excellent analytical skill and domain knowledge (we call these Type III and Type IV analytic types as shown in Figure 1), organizations can reduce the effort of the rare data scientists and ease the problem of staffing more of them.
Training the Go-To Guys
People in organizations tasked with providing analysis (or the reports and dashboards for analysis and reporting) work with business intelligence tools. Within the group that learns these tools are specialists and experts, the “go-to guys” with subject matter knowledge in the various functions of the organization. They extract data, work with the data warehouse, build cubes and produce useful information for the rest of the organization. The go-to guy has some skill writing SQL, which is, by the way, rapidly becoming a standard for big data as well and has a good to excellent grasp on the business domain, such as finance, marketing, distribution, pricing, etc.
Go-to guys provide an indispensible service to the organization. These are the Type IIIs in our model. What do they lack to move into a Type II role, defined in Figure 1 as Type II-B: light data scientist? Essentially, they need to learn about the data sources such as Twitter feeds, weblogs and other sources; they need to learn how to use (but not program) analytical, statistical and algorithmic models. There is no reason to assume that many of them can. Training can be provided in house through self-study and especially online as there are many options today.
A lack of a Ph.D. should not be a barrier for a Type III shifting into Type II. Type II data scientists do not need Ph.D.s. They do not (except in rare circumstances) produce original research, publish in academic journals or share and collaborate with their peers in other organizations. A Ph.D. requires a broad range of learning in the discipline, most of which is not applicable to “data science” in a commercial organization. The last few years of Ph.D. candidate’s life are consumed with the dissertation, which is original research and is pinpoint-focused on a topic that may also have noting to do with analyzing big data.
There is an excellent opportunity for the BI analyst to move into Type II with on-the-job and/or distance learning. For those with some math/engineering/physics background (or those willing to acquire it in online training), moving into Type II roles over time is also possible, but this movement requires employers to understand that locking the BI analysts into their role will likely result in turnover as they will surely be recruited elsewhere. The organization should support those wishing to make this move with at-work study time, mentoring and providing an environment where people can enhance or even change their careers without danger of dismissal.
For many, learning math is painful. Part of the problem is math without an application is too abstract for many people to grasp. But learning to use quantitative functions and where they apply to the business a person is in is much easier. There is no need at what we refer to as a Type II-b level analysts to learn to differentiate a moment generating function. Instead, it is necessary to understand what kind of model is appropriate for the problem at hand, how to run it and how to evaluate the results. However, these investigations should rarely be put into production or relied upon without being vetted by more senior staff.
Moving Type IVs to Type III
For the most part Type IV analysts were confined to the structured data culled from various internal systems within the organization. In the best of cases, this data was carefully modeled and extracted from other systems with a high degree of quality and reliability into data warehouses and BI tools. However, in too many organizations, the Type IV analyst had to learn the structure and semantics of various operational systems and perform their own extraction and transformation into various personal databases and spreadsheets. For that reason, their skills were more heavily weighted toward data and structure, and much less so toward analysis. For the purposes of management and external reporting, performance management and certain analytical endeavors, this was sufficient but tedious.
In our experience, the Type IV analyst has potential that has been throttled by technology. Their skills at manual data management can be easily transformed into more useful Type III analytics with the proper training and encouragement. Type IVs who work with BI may either continue, as there is still a need for their work, or learn to work more analytically than their current role requires.
The big data technology market is evolving at an extremely rapid rate. All sorts of tools and products are emerging to ease the burden of managing, analyzing and explaining big data, making it more likely that Type I analysts can move up to Type II or even Type III
A New Type: Type V
It is well understood that the majority of people in organizations are not involved in analytics at all as we have described it. This is not true. The most widely used analytical tool, even for data scientists, is Microsoft Excel. The use of Excel by people in professional positions is nearly universal. But if Type IV analysts begin to move into Type III roles (and, unfortunately, this movement may be by position rather than by person as the result of attrition, retirement, etc.), their skills are still needed. No one has sufficiently answered the question why only 15-20% of knowledge workers use BI, but the answer is likely to be the ease of use, relevance and understanding of the tools. This is clearly improving and despite all of the attention big data gets, the BI industry (which does not include Excel) is about $10-12 billion per year and new vendors are emerging every day.
The Internet connects the world; the Web offers the means to tap into those connections. Search capabilities let people find things; big data gives people the opportunity to understand what they find.
Data scientists in large digital companies like Google have the luxury of being able to explore and experiment, often doing things by hand. This is not a necessary model for other organizations. What your analytics provider should offer is a set of advanced analytical models and tools to eliminate the Not Invented Here (NIY) syndrome. Your analytics provider should offer a high performance, high availability analytics service one level of detail down from raw big data. Eliminating the handwork of preparing and moving data and results makes data scientists more efficient.