Inside Analysis

Data Profiling and the Philosophy of Data

This interview is part of The Bloor Group’s research program, Philosophy of Data, which has been underwritten by IRI, The CoSort Company.

Eric Kavanagh: Ladies and gentlemen, hello and welcome once again to Conversations About Data. My name is Eric Kavanagh. I will be your guide for a discussion today with a real expert in the field of data and data management. We’re taking a journey to understand what data really means to today’s organizations, how it can be used, when, why, where it should be used, and of course sometimes where it should not be used. The whole goal here is to better understand what’s happening in the data movement. Obviously there’s a lot of stuff going on. We hear about big data of course almost daily, even in major newspapers and major media outlets, consumer media outlets we hear stories about big data. This concept has taken off, but let’s face it, big data is just one part of the data set out there. The fact is that small data, if you want to call it that, is really what drives just about every organization today.

We’re going to dig into the details and understand just really what’s going on from a broad perspective to try to map out a picture of the data world and get a better understanding for what the various disciplines are that come into play and frankly how companies should put them in use for their own organizations. I’m very pleased to have Eric Leohner on the line today. He is a research analyst with a very interesting company called IRI. It stands for Innovative Routines International. It was founded back in 1978 and really focused on high performance data management from the get-go. These guys have a lot of experience in data management that dates way back before many of the companies that you hear about today were even conceptualized, quite frankly. First of all, Eric, welcome to the program.

Eric Leohner: Thank you for having me, Eric.

Eric Kavanagh: You bet. I want to just throw out a few quick words here and then we’ll dive in and talk about some of these different topics. The way I think about data, it’s just so important to business. The fact is that you cannot encounter a business process today that does not involve data. All software, in fact, for business, and we know about all of these legacy systems that exist. There are many very proprietary systems for industry verticals for example, there are all kinds of garden variety tools like Microsoft Excel, there are all these ERP solutions like Enterprise Resource Planning solutions, Customer Relationship Management solutions, but all the other very specific focused enterprise technologies and even small business applications, guess what? They all use data.

Let’s face it, data, it’s the new oil. A lot of people talk about it, but oil by itself isn’t very useful. It has to be refined, it has to be prepared to get into the different use cases where it makes a lot of sense, like for cars of course and all these different products that are created from petrochemical disciplines. In that same sense, data is the foundation of business. Some of the scary stuff, to kind of start our conversation now, is the fact that data quality is very bad.

One of my favorite consultants will often say, “Your data will never be clean.” He’s being a bit hyperbolic, but the point is that data quality is a real issue and there are lots of reasons for that. Data entry can be bad, data integration can be poorly done, data migration can be poorly done. A lot of times people don’t do data discovery enough or data profiling to understand what the data means. What that results in at the end of the day is unpleasant situations. Everyone has gotten some mail at some point that was sent to the wrong person, people have seen their name misspelled, for example, on documents that they received, there are typos. All kinds of things happen to cloud the data quality picture.

What we’ll talk about today a little bit is data preparation, as some people are calling it, so how in organizations make sure that they at least optimize the quality of their data, even if they don’t wind up with a perfectly pristine data set at the end of the day. The whole conversation today is going to be about understanding different disciplines, how they interact with each other, and why it’s important to do them, and frankly why it’s important to do them in a certain order, at least with respect to this concept of data profiling. I’ll just say a couple other quick things so people can understand why it’s so important.

If you look at some of the biggest success stories in the last few years, let’s face it, they are data-driven organizations. Companies like Uber, companies like Airbnb, from my perspective what these organizations did is they understood a business opportunity and then they nailed down the business processes and the data management practices that underpinned their business model. That’s why they’re so incredibly successful. Of course you look at companies like Facebook and Twitter and LinkedIn. All these companies use data with extreme care and they are very, very innovative in how they use data, in how they can leverage the power of data. What we’ll talk about today is how regular organizations, small to mid-sized businesses or small to mid-sized enterprises as some people them, even Fortune 2000 companies can do a better job in taking a step back and understanding the big picture of data management and data preparation. I love that term because it really kind of speaks to the importance of getting your ducks in a row before you try to do some things with your business.

We’ll try to help our listeners and our readers better understand what they can do. First of all, I’ll kind of throw it out to you. I like these categories that we put together here: data discovery, integration, migration, governance, and then at the end of the day analytics. Why don’t we tackle data discovery first and foremost. I know you’ve been doing a fair amount of research in this space lately. What are you discovering and what can you tell us about the importance of data discovery or data profiling?

Eric Leohner: Data discovery and data profiling, it’s one of the first steps you’re going to encounter in that data management infrastructure. Really it refers to the process of obtaining information from and descriptive statistics about various sources of data. The whole goal here is to get a better understanding of the content of the data that you have as well as its structure, the relationships of it. It’s about the quality of the content and how accurate and the integrity of it. Pretty much it allows you to get an aggregate of your information or your data. It’s not really information at this point. It’s still raw data. It can tell you how it can be used, what needs to be fixed, and what elements of it need to be ignored before you can even turn the data into information.

Eric Kavanagh: Yeah, you make a good point here. I’d like to take a second here to drill into this. Data profiling can also help you understand how much work is going to be necessary to get a certain data set ready for integration, migration, governance, analytics. Frankly sometimes if you do data profiling properly you realize that you can’t even use a certain data set, right?

Eric Leohner: You might go through in the profile of data and you realize something’s completely out and you just can’t use it. It’s really just there as a preemptive measure before you do any of the other operations we’re going to discuss today.

Eric Kavanagh: Yeah, that makes a lot of sense. Garbage in, garbage out is the classic mantra in the data management world. Let’s face it, if you bring a bunch of bad data into your mix, you’re going to have problems down the road too. It really is imperative that organizations understand as you’re trying to bring in the new data set because you want to do some analysis let’s say, you must do some data profiling. The fact is no data is going to be completely clean or very rarely will you get data that’s completely clean, so there is going to be some allowance for data quality issues, but you want to be able to ensure that you have enough of a quality level in your data set to move forward and that data profiling is how you figure that out.

Eric Leohner: Currently hundreds of billions of dollars allowance for you because of data quality issues. Before you can even move on to any of the other categories, you’ve got to tackle data quality. It’s a given that data quality is going to be in somewhat of a disarray when you’re getting into this because people have been aggregating data for so long and they’ve really not been managing it properly, so they’re going to have data that’s a bit wonky, to put it bluntly.

Eric Kavanagh: There are some different tactics that you can use that data managers know about. Maybe we can kind of give a layman’s definition of what these all boil down to. One thing you can do obviously is just profile the data, so you take a look and see this first name, correspond to first name, does the last name look like a name, if you have numbers in the name field obviously that’s a problem, if you have an email address in the name field that’s a problem. These are some basic things that you can find right away, but you can also do some sort of cross-column analysis, right? To expose maybe some embedded value dependencies, for example? Can you kind of talk about what that means and why that’s important?

Eric Leohner: Basically when you’re dealing with columns and rows of data you can add something that’s linked up to another data through foreign keys or whatnot. You can have a dependency that exists that you’re not even aware of. If somebody changes that data even haphazardly at some point, the rest of the data can become all askew. Really this is just to bring to light the dependencies. You can just say, “Oh, okay, yeah. This is what relates to what.” Before you even do anything else, you could already know what your data’s doing and how it interacts with other data you have.

Eric Kavanagh: I think the real key to understand here is that you have to roll up your sleeves as look deep into any given data set that you’re looking to onboard, because as you suggests, some of these dependencies are going to cause some issues down the road. You need to unravel that stuff before you move it on. It’s almost important to understand that this is a classic case in any business, that many times people will do work-arounds to deal with some constraint. Let’s say we’re dealing with data input issues. A lot of times organizations will build a model that will constrain the amount of information that gets put in there or will also mandate that information gets put into certain fields before the record can be recorded basically, before it can be assimilated into the database or persisted in the database.

What happens is sometimes data entry people don’t have information to fill out that extra field and so they’ll just make something up. They’ll put in a null value or they’ll put some object in there. These are the kinds of clues that you look for as a data profiler to understand, “Wait a minute, there’s an issue here,” and then what you do is you ideally reach out to the businessperson or whoever is in charge of that data set to understand what happened here, right? You need to work with the business, work with the people who are the front line at the point of injection or at the point where this data comes into the organization. You want to have a data profiler have access to these people so he or she can better understand what those discrepancies really mean.

Eric Leohner: In a lot of industries there’s a big problem of data silos where people don’t want to give out this information and yet this is critical information that really needs to be distributed and understood between all the branches of an organization for it to have any real meaning and for it to be used by an enterprise efficiently.

Eric Kavanagh: The real issue in play here is the whole concept of trust. I think this is one of the harder ones to wrap your head around because it’s fairly nebulous, right? There is such a thing as a trust variable, meaning you can have someone who more or less trusts someone or some system, but usually it’s a pretty slippery slope, meaning as soon as someone encounters some problems with the quality of data in a system, their trust in that system goes down rather precipitously.  The objective here is that we’re really trying to help organizations understand the importance of profiling to engender trust with the business such that people may use these systems. Let’s face it, if someone doesn’t trust a system of information, they’re going to go around it. They’re not going to use that system and that causes all kinds of problems downstream, including more data quality problems and definitely some data governance problems because when people go around the protocol, when people circumvent the rules and the regulations, that’s a big governance issue. That’s a violation of governance.

Trust really is just core to the process here. It’s very common for people to take a fairly colloquial stance. You mentioned the word “paranoid” and sometimes it is a paranoia and a lot of times people are just concerned that their data is going to be misused and maybe they’re concerned that people are going to discover those data quality problems. I think one of the key takeaways is, with any organization, as we see now this call for a chief data officer, so in that situation this would be the person you would talk to, but in other situations it’s someone else, but regardless you want a senior person in the organization to be involved and have their skin in the game for your data management practices and especially for data preparation because that way this person can help the members of the team better understand the importance of their involvement.

People need to feel like they’re going to be OK, trusting they’re not going to get in too much trouble. I think this is one of the real challenges and why you see such concern in sharing data is people don’t want to get in trouble. They don’t want to get their boss mad at them. They’re worried about sharing that data. I think it just behooves organizations to really take a step back and appreciate the big picture. If you don’t get all these people are the table together, if you don’t spend the time to look into the details of the data, you’re going to have problems sooner or later, so you might as well just roll up your sleeves and get to work.

Eric Leohner: That’s the long and short of it. You have to have that chief data officer who is pretty much the referee between these silos. Frankly you’ve got to have somebody breaking down these preconceived business barriers before you can even get to work on managing your data.

Eric Kavanagh: Let’s kind of move forward a bit here. Data discovery, we know is important. There are a lot of tips and tricks and we’ll try to work through those in further chapters of our book here to help people understand and look for those little clues. There are lots of different things that you can do. You can do queries on null values, for example. That’s something to look for. You can simply sort by different columns and see where you have problems. There are a lot of profiling tools that can be leveraged to expedite that process because the bottom line is you don’t want to have to do all this stuff manually. That’s why we have tools in this space, that’s why we have technology that can expedite the process and really focus attention on the problem areas. Then you look for areas to fix things in bulk is the idea.

We’ll get into that in further discussions, but let’s move on to the next big piece of the puzzle here, which is integration. To me, data integration is and will remain a huge component of success in any organization because it’s only once you cross-reference or correlate data from different areas of the business that you really start to get somewhere. Marketing data versus sales data versus accounting data, for example, versus market data. What are your competitors doing? What are some standards in the business? How can you use third party data to get a bigger picture, a more clear picture of what your company’s doing and how well it’s doing?

Let’s kind of talk about the data integration challenge and again, I reference all of these legacy systems. A lot of people don’t realize that there are systems thirty, forty, fifty years old that are still running businesses. There are mainframes out there that have been running for forty or fifty years, for goodness sake. That’s a long time. They have some data structures and the fact is there are now systems every single day. A lot of people using the cloud services like or email marketing services or any kind of business process automation service that’s available in the cloud or even on premises where they all have their own unique data structures. They all have their own unique data formats, for example. These are challenges of the data integration world.

Eric Leohner: I think we should start off with a definition of what data integration really is. Really we’re talking about a combination of technical business processes used to combine data from different data sources and converting that into meaningful, useful, and valuable information. This is a huge discipline. I’m certainly not going to be able to cover anything but a scratch of it today, but it ties in with data profiling and quality as well as big data movement and manipulation, especially in high volume ETL systems as well as analytics from the results down to non-persisting or federated views of the results. It’s a massive topic.

Eric Kavanagh: Of course ETL, for those who don’t know it, it’s the bread and butter of data integration. It stands for extract, transform, and load. This is what ETL tools have done traditionally is they will extract data from let’s say an Enterprise Resource Planning system or a CRM system like we talked about either. They will transform it in a way, which is that preparation side, and then they load it into whatever tool is being used for analysis.

This actually gets into my whole background of this industry, which was the data warehousing movement. A lot of people don’t know that the data warehouse came into existence largely because companies realized they could not do analysis of data in these source systems. FAP for example, the big German company that really started in manufacturing, is very good at this Enterprise Resource Planning stuff that they do.  Trying to do analysis in those systems is not very easy because they weren’t designed to do that. A lot of people I think don’t really full appreciate the whole concept of the design point. When you write software, there’s a very specific thing you’re trying to accomplish. What you’re trying to accomplish is going to drive the nature of that technology, how it’s built, what it looks like under the covers, what the interface looks like, et cetera.

Data warehousing evolved as a means of allowing companies to do better analysis of their data by pulling data out of all these sources, of accounting systems, planning systems, supply chain systems for example. ETL is the bread and butter for years and years, but one of the problems, and I think this is where the bottlenecks have occurred, is that ETL is a fairly brittle process and it tends to be time-consuming. You wind up with all these batch windows and then you get people fighting over the batch windows. You can only move so much data within a certain period of time.

I think this comes into an area where your company has actually excelled over the years in being able to optimize those kinds of processes. It’s critical that organizations understand this is difficult stuff, it’s time-consuming stuff, and you have to have that big picture view to understand where you can refine your own business processes around data management and what you can do to make the trains run on time, to keep things rolling, and to keep business users satisfied, right? The business is always going to want new data sets, but then you have people in the position of information technology or data management who have to supply that data and you don’t want to rush those people because that’s when data quality problems occur, right?

Eric Leohner: Oh yeah. Today with big data, doing all that faster is of the utmost importance. Data warehousing was and still is a lot of IRI’s bread and butter and it has been since 1998, since Hyperion was the original ETL engine. Because of the transformations like sorting we do so well, still a big part of this from reorganizing the data in the middle of ETL to make data load faster to databases. Really this is something we’ve been dealing with since the very beginning. This is a field where you have to be dynamic. If you’re not dynamic, your only option is pretty much to shutter your business.

Eric Kavanagh: Yeah, that’s not good. That’s not good. There are a lot of case studies out there for how organizations have tackled the data integration problem. We’ve kind of laid the foundation for why it’s important. What are some examples of companies that have figured out how to do this or have had trouble maybe?

Eric Leohner: There was a global imaging firm that wanted to add an e-commerce revenue stream and it also wanted at the same time to improve customer service overall. The solution required integrating existing customer data with the new online transaction system they were implementing. As a result of this, they created unified views of the customers from the joint data, which enabled the customer service representatives to handle customers faster and more efficiently, which ultimately cut call center costs within just a matter of a couple of weeks from when they first started to implement this system. Even though like you said, it could take a long time to really get something thoroughly in place, but this, you can actually start making money on your investment within literally a matter of weeks.

Eric Kavanagh: Yeah, you bring up a really good concept here and I want to talk about this for a minute. In the business world of data management, there’s a concept of quick win. The quick win is pretty self-explanatory. It means that once you’ve engaged either a consulting firm or a software firm to solve some problems for you, the quick win is some success story that comes out in fairly rapid fashion. A lot of times a quick win could be, for example, identifying a useful data set or identifying, for example, that a data set is so out of line that it really needs to be pulled out of the equation.

I’m guessing an example, like you were talking about here, customer experience is such an important issue these days because people have so many options for where to do things and the fact of the matter is that prices are coming down, it’s part because of cloud computing, it’s part because of global competition. If you can, for example, feed a call center of a feed an e-commerce system with a new set of data, that can do a whole lot in terms of keeping your customers satisfied, which obviously is an ultimate goal.

Eric Leohner: Recently Gardner released a statistic stating that eighty-nine percent of companies feel that by 2016 customer satisfaction or ensuring customer sanctification will be a massive part of their revenue stream, so really we live in the age of option and opportunity. If you don’t give your customers the options they want, they have the opportunity to go somewhere else and they most certainly will capitalize on that.

Eric Kavanagh: Yeah, the cloud is a wonderful thing, but again, it can be an issue. One of the best practices, it seems to me, of using cloud solutions is to make sure someone in the data team takes a look at whatever the solution might be, let’s pick on SalesForce for example, and take a hard look at what that data model looks like. What are the inputs? What are the outputs? How can you reconcile your existing systems with those kinds of systems? There are lots of devils in the details it seems to me with cloud-based solutions, but to kind of get back to something we talked about a little while ago, the cloud winds up being an area where people, businesspeople in particular, will go if they feel they’re not being given what they need by the data management team.

If you look at, for example,, if you look at some of these other online solutions for dealing with different aspects of the business, a lot of times people in businesses will employ these solutions because they got tired of waiting for their company to build them or to give them access to those systems. This is another reason why it’s important for the business to constantly reach out and understand what different stakeholders in the organization need and make sure there’s a roadmap so they understand how long it’ll take to do it right. When businesspeople know that they’re being taken care of, they’re less likely to circumvent your processes. They’re less likely to just go off and do it their own way. Let’s talk about some of these other use cases too.

I like this example of improving customer service at the call center. People who are customers, they want to know that their needs are being met. If you can feed these people, and here we go with the data integration challenge, if you can feed your call center with information about different customers, for example, who you high value customers are or who your major institutional customers are, that’s going to help those people address the needs of the customer and keep the customer happy.

Eric Leohner: If you’re able to better identify your customer and you’re able to identify your key customers, you’re going to be able to make sure that you retain them and in doing so you’re going to be able to ensure that you do better for future customers and new customers. It really comes down to an issue of being able to identify with your customer. That’s the long and short of it is keeping your customer happy. You really have to listen to them.

Eric Kavanagh: Yes indeed. Let’s kind of move on to the whole concept of migration. Since we’re talking about the cloud already, that’s an issue to discuss. That’s an issue to keep our attention focused on. A lot of companies are using cloud-based solutions. We mentioned SalesForce. I do a lot of marketing myself, so I use various email marketing applications. A lot of people are getting into social obviously. When you try to move a system from on premises to the cloud for example, once again you really have to think through what are the key data elements that you need, what does the business process involve, and what are those dependencies? You talked about that at the top of the conversation here. Dependencies are really important to understand because if you don’t weave those into the matrix if you will, if you don’t find a way to flatten them out and load them into these online systems, you’re going to be losing critical points of data. You’re going to be basically creating a disconnect that can cause you some problems. Can you kind of talk about best practices for data migration?

Eric Leohner: Yeah. Data migration, there’s several reasons for doing this. Like you mentioned, there’s the reason for doing it, to taking your data from a cloud, but there’s also you could have anything from just server storage, equipment, replacement, or update, anything from website consolidation, server maintenance, data center relocation. These are aspects that are really an integral part of data migration because when you’re going to have something out of commission, you’ve got something to move somewhere else, in the case of a legacy platform where you’re implementing an entirely new system you’re going to have to deal with data migration. Another very common data migration or data management activity, especially when companies decide to move off of legacy platforms or to merge towards to a more digital system in the event of a merger or acquisition, it may not be part of an ETL operation.

Eric Kavanagh: Yeah, exactly. You have to really decompose your business processes, right? This kind of gets us back to what I mentioned at the top of the hour with Airbnb and Uber and these other major players. What they did is they decomposed business processes and they were just very diligent in terms of building information management infrastructure to handle these issues and to handle large amounts of data and to do so with blazing speed. A lot of the things that you guys talk about at IRI revolve around speed and the fact that years ago you built some algorithms or you built some technology I should say, to optimize data movement, data sorting, and so forth. These are the fundamentals of data management.

What these companies did is they really understood the business process, and I mean point to point. Call comes in, information goes out to the driver, someone accepts that call, in other words accepts the fare, goes out and gets the person. Because it’s then doing such a good job in building out that infrastructure and really focusing of data quality, what happens now, they are able to deliver service much faster, they’re able to communicate with the customer much more effectively. That’s extremely useful because it gets us back to that wonderful customer experience.

Eric Leohner: You’re going to have all this data which needs to be in accord before any of this can even happen. Another issue is they have so much data to deal with. Sometimes it stayed in different systems so it has to be merged and migrated into one central place. You’ve got one person who is a returning customer. You can easily say, “Okay, this guy’s a returning customer. We’ve got his data on file. We know exactly where he needs to be.” You might even be able to predict where he wants to go. It really just comes down to having all of your data into a clean, centralized format that can be transformed quickly.

Eric Kavanagh: This is the kind of things that companies just have to think through as they’re looking to, for example, either leverage the cloud, or a lot of times in mergers and acquisitions, what happens is the company needs to decide, “Are we going to use your system or are we going to use our system?” That’s a case where the company that has done a better job, not just at building their information pipeline, but also documenting the information pipeline so that the new people can understand it, those organizations are going to be the ones, probably their going to win the day and it’s their system that going to be used going forward.

Eric Leohner: One example, a financial firm that had tens of millions of records leaving through different systems, ergo it had different data types that were across those systems. In the process of converting to a single system, it was able to merge all those files into a single data format. It removed the duplicate records, resolved errors and inconsistencies. The data was not only migrated, but it was also cleansed and consolidated, which reduced processing times overall. They had smaller bites of data as a result of it, and this cut the processing operations almost in half. From a price standpoint, the processing operations were cut in half.

Eric Kavanagh: Yeah, that’s really good stuff. That data duplication is such an issue. It’s funny that it comes down to something so simple, but any time you are processing data in redundant fashion, you are costing yourself money and you’re also costing yourself potential customer issues. I can’t tell you how many times I’ve encountered just in my own life because an account in multiple places within an organization. I’ve heard some amazing stories of how in Customer Record Management systems you’ll have twenty-seven different variations on IBM. IBM, I.B.M., International Business Machine, Big Blue, IBM Company, IBM Incorporated. If you look at them closely, you’re like, “Oh, okay. That’s all IBM.” A machine, and this is really important, machines need machine readable information. A lot of times information systems are pretty picky about how they receive that information. Commas can throw things off, periods can throw things off. This gets us back to data profiling and data cleansing and so forth.

I think one of the key takeaways for the business to understand is that when you engage these kinds of projects and you try to simplify your business process for example, what you need to do is really take your time and be very thoughtful about the whole process and understand that the business value at the end of the day is going to be time saved, money saved, customers happier. There are all kinds of reasons to do this. That’s the story that the technical people need to tell the businesspeople or the data people need to tell the businesspeople, right?

Eric Leohner: Even something as simple as duplicate removal via an automated process is going to saving you tremendous time and money later on when you’re not dealing through tens of thousands, hundreds of thousands of records that might not be correct or they’re just difficult to record. Not to even mention that you’re looking for data that’s out of range or doesn’t meet the matching threshold or the matching algorithm. With the case of IBM, once you’re able to implement a non-bullion system … Our Sorenson-Dice coefficient for example, we were able to look for things like IBM or I-B-M. You’ve got to be able to find these duplication so you can say that I/B/M or I-B-M, what have you, is the same thing as IBM in its given context. You’re going to be able to remove the replication there and cut yourself down on time and money.

Eric Kavanagh: Again, you’re going to make the customers happier. Let’s kind of segue from that into one of the bigger issues out there, which is data governance. This is a field that I think frankly has been rather immature for kind of a long time now. It’s probably because the tools weren’t very mature and maybe it’s also because businesses have just been reluctant to embrace data governance, but data governance really gets down to understanding the life cycle of data, the process, the provisions of data, the access to start it, the manipulation. What’s the lineage of the data? Who gets to use the data? There are some simple things to keep in mind like giving access. Who has access to this information system? That’s a fairly system control point, right?

In the world of governance, for those who are not too familiar with it, there’s this concept of control point, which is any point at which some change can be made. Human beings are control points, information technologies are control points. Any kind of system that lets you touch data, manage data, transform it, delete it, and so forth, that’s a control point. Data governance really goes hand in glove, or hand in hand, depending on your favorite expression, with data integration and of course with data profiling. The idea is that you really want to have a thoughtful process by which data is accessed, transformed, moved, and ultimately moved.

Eric Leohner: It’s not all about the software here. You’ve got to have a capable team in play. Really it’s all about the overall management of data so that you’re able to ensure its availability, usability, integrity, and security so that you can be sure that the data you need is going to work when you need it to work. That gets into the whole topic of information stewardship, which is a topic for another discussion. A lot of companies use data governance software and solution services pretty much to manage their data, their metadata, master data, in a way that improves data quality and application performance because it reduces IT competence risks and data breaches. Really you can’t do anything but gain from employing a data governance initiative.

Eric Kavanagh: Yeah. One of the things to help people understand too, the value here is that a well documented and well executed data governance program is also going to help in troubleshooting. When something goes awry, when there is a data breach … Of course we don’t want to hear about that, but it does happen all the time. It leads to us having mechanism by which to understand who did what and when and thus you can problem solve, you can troubleshoot. You can address issues. I can tell you for sure that auditors, whether it be a financial auditor or a regulatory auditor, they love to see companies that have thoughtful processes in place and really pay attention to how their data gets used. They’ll understand. These auditors will be fairly forgiving when breaches occur if an organization was able to identity, address it, and then remediate it fairly quickly, right?

Eric Leohner: Especially if you put an audit trail in place. Not just for financial data, but of the data processes in themselves. If I can give you an example of this, a major communications company started using software with data governance capabilities to meet a data privacy law compliance requirement. As a result, it saved several hundred thousand dollars a year in hardware and DBA costs because they were able to discover these redundancies, they were able to plug up the privacy holes, and they were able to automate some of the stewardship activities. Really by employing this governance strategy they were able to go back and say, “Okay, yeah, we’re able to meet this compliance requirements. We can do all of these things that we were not able to do before.” That was the whole crux of data governance, is pretty much making sure your data is secure and that people actually want to use your company. Rather that they’ll have confidence in your company to put their data in.

Eric Kavanagh: Also governance, the beauty of effective governance, is that once you find an error, once you find some problem in your data, if you have this audit trailer in place, if you have a governance program, you should be able to identify where to fix something. A lot of times you’ll have feeds coming in from either other departments or other organizations. Sometimes it is third party companies that are feeding you data. If you have good lineage, if you have a good auditing program, and you find data that is funky or is not what it’s supposed to be, now you’re able to troubleshoot it and stop that process and prevent further problems.

This I think is one of the biggest issues with data quality is that companies will employ some big technology to clean their data, making it maybe from eighty percent to ninety-four percent clean, let’s just pick numbers out of a hat, but the problem is they don’t fix their business processes and thus is goes from ninety-four down to ninety-three, ninety-two, and all the way back to eighty again after six months or so because they are once again populating their data systems with that bad data. If you have that auditing process in place, that audit trail, then you can troubleshoot not just for breaches, but for data quality problems and you can solve problems. You can solve them and make the situation such that it doesn’t happen again, right?

Eric Leohner: Stewardship involves more than just remediation, but ongoing monitoring of the data quality and the proper transformation and protection. Even after you do that work, you have to be sure that it’s still being done. Rinse and repeat based on what the audit laws suggest. It’s not a simple one time solution. It’s something that you have to be engaged in over time. Ideally this is something that your company will be doing for as long as it has data.

Eric Kavanagh: Yeah, that’s right. I’m glad you brought up the concept of stewardship too. This is a really powerful idea. To help those who don’t really fully understand what that means, a data steward is really someone who sits between the business and the information systems themselves. A lot of times a data steward will be someone who understands the business processes but also understands the underpinning technology. This person can be a liaison to explain to the businessperson what’s possible, what’s not possible, what’s ideal in a certain situation, and also explain to the IT person what the businessperson needs.

We talk a lot about the IT business divide in this industry and that’s something that must collapse. It’s really already collapsing in part because we see this third party coming into play of developers. We have business people working directly with developers and dev ops is one of the concepts that gets into play here, but nonetheless the data steward is such an important person because they really are the liaison or the ambassador if you will, that sits between businesspoeple who maybe don’t know data technology too well and the IT people who maybe aren’t familiar enough with the business to really understand what they need to do, right?

Eric Leohner: I guess that it’s the chief data officer, he’s the one steward of the governance or the compliance officer as well as the data architect who understand the business needs for the data. You’ve really got to value your information. I really love this quote. It’s by Mike Smalley. He says that, “People would not treat money in the same way they treat information. If, shall we say, USB sticks or DVDs were made out of solid gold, I think that people would be more careful with them.” You’re dealing with data that’s far more valuable than anything you could ever put in on and that’s really why you need these steward initiatives in play here.

Eric Kavanagh: Yeah, that’s a really good point. I love that quote. I’m going to have to use that one. Then of course we’re leading up to the hot topic of the day and it’s all over the place these days in the data management world and that is the concept of analytics. Let’s face it, if you do not have good processes in place for first of all, identifying valuable data sets, profiling them, integrating them, migrating them, governing them, you’re not going to have good analytics. You simply can’t. You have to do all those processes first in order to even have fun and generate some business insights in the world of data analytics, right?

Eric Leohner: Exactly. This is the pinnacle of everything we’ve just been talking about over the last half hour or so. This is what it’s all about. You’re going to take the process of data discovery, filtering, cleansing, integrating and transforming and protecting it. Now you get to the part when you can see specifically, “Oh, this is how we’re going to make business decisions about this. This is how we’re going to glean new insight. This is how we’re going to get a competitive edge here.” That’s what this whole field of data analytics is all about.

Eric Kavanagh: Yeah, that’s exactly right. There are a bunch of different kinds of analytics. I guess we should just throw out some definitions here and then maybe you’ve got some case studies you can talk about. The general categories are descriptive, we have diagnostic, we have predictive, and we have prescriptive. Descriptive is really what happened. Diagnostic, why did it happen? Predictive, what will happen? Prescriptive is how can we make it happen, right? What organizations are doing is they’re understanding their data and then they’re able to, and this is a big one, forecast what’s going to happen and get a better idea, whether it’s in terms of supplies they need to purchase, personnel they need to hire, processes they need to put in place, whatever the case may be to have that good historical view of what happened and why it happened. That’s when you can start predicting with greater accuracy what’s going to happen and even the be all, end all, if you will, prescriptive of almost ensuring that you can make it happen the way you want, right?

Eric Leohner: Oh, yeah. There was an American multinational corporation that reviewed data from gas and oil fields daily. They described the processes to start with. They were looking at these drilling field that have way too many wells for them to analyze individually. They adopted a software solution that collects, processes, and exposes the pertinent data in relation to those wells, analyzes the information, and then delivers it in a user-friendly visual report that’s derived from that information itself. This corporation was able to get a tenfold ROI increase because of this, and more astoundingly, it only took them three months to recoup their investment. Once you’re able to analyze your data, you’re able to do phenomenal things with that once you turn it into information.

Eric Kavanagh: Yeah. Again, the key is to have that audit trail of where the data comes in and have some documentation about what it means, who’s responsible for it, et cetera. Just as an example, you could say that you consider as one of the dimensions in this solution, and I’m not familiar with it but I’m just guessing, one thing you could do here is take a look at which engineers are responsible for which wells and then you could start doing some analysis as you notice data quality problems over a period of time and realize, “Well, let’s see. When Joe coordinates these operations we tend to get pretty good data quality, but when Bill over here does it, it tends to get a bit shoddy.” Now this is the kind of thing that an audit trail gives you and that documentation is kind of thoughtful analysis that the data gives you. It gives the decision maker the ability to understand what’s happening and why it’s happening and then change something, a.k.a., fire Bill, right?

Eric Leohner: If it comes to that, unfortunately. For the good of business I suppose.

Eric Kavanagh: Yeah. This is good stuff though. Again, analytics, that’s the crème de la crème, as they say. That’s what people are trying to understand. That’s what these big companies like Airbnb and Uber and these other organizations have done so well, is they gather data, they analyze it, they decompose business processes, and then they rolled out these solutions. They’re open to change, right? This is one of the sort of nebulous sides to the equation here, the human side of the equation. I would throw out the importance of being able to change something and make a decision. The thing that is really nice about the data side of the equation is that it’s going to give the businessperson the confidence to make those decisions and to back them up when, for example, the board comes along and asks, “Why did you do this? Why did you fire Bill?” You’ve got some ammunition that you can explain.

Eric Leohner: You’re going to have to have your information to back up anything you do in a business context. Even like these companies, like Uber and Airbnb, even though they’re collecting all of this data, it’s not just sitting around doing nothing. They have to be able to manipulate it in such ways. They have to discover it, they have to mash it up, they have to integrate it from different sources, they have to clean it, they have to migrate it. There are all these numerous processes that they have to do before it even gets to the point where, “Okay, now we can analyze it. Now we can actually take this data somewhere and we can take our company to newer heights.” That’s really what analytics is all about, taking the information you have, taking the data you have, and using it to propel your company. You want to be able to project your company forward.

Eric Kavanagh: Yep. I think the key takeaway really is to remind people that we’re talking about a process. It’s an ongoing thing. Market dynamics change all the time. Successful companies, it seems to me, are the ones that are going to stay on top by effectively incorporating new data sets into their operations. It’s really important to remember that each step of the process is critical, right? If companies want to succeed, they’re going to need to embrace a very thoughtful approach to identifying, profiling, integrating, migrating, and governing data and that’s when the real analysis can begin, right?

Eric Leohner: Exactly. All that feeds into infonomics, which is the economics of information and especially digital business, which is a blending of people, businesses, and things in new ways. We’re already seeing manifest in smartphones and smartwatches that are actively being linked with televisions and vehicles. The need to discover that data that’s being generated in device logs and internet things should become just as valuable as the traditional transactional data you mentioned from decades old mainframe application. Discovering what’s available, structured and unstructured, old and new, and mashing it up is where data integration and migration come in, along with the need for hunting data and producing federated views plus protecting it so it complies with privacy laws and reporting requirements. In other words, that’s just government.

Finally, once the data is newly pressed and subsetted for visualization, it’s ready for those analytic processes that we talked about for learning, fixing, forecasting, and preventing all these other problems. Again, it’s about driving insight by turning data into information efficiently, accurately having a data life cycle management system in place that makes use of the activities we talked about today so the information becomes the insights you can really capitalize on because that’s what it’s all about.

Eric Kavanagh: Yeah, that’s exactly right. Well folks, this has been a great conversation. It’s just the beginning. We’re going to have several more conversations over the next few months and we’re writing a book. Stay tuned, folks. We’ll catch up with you next time. Take care, bye bye.

Leave a Reply

Your email address will not be published. Required fields are marked *