This interview is part of The Bloor Group’s research program, Philosophy of Data, which has been underwritten by IRI, The CoSort Company.
Eric Kavanagh: Ladies and gentlemen, hello and welcome back once again to A Philosophy of Data. My name is Eric Kavanagh. I’ll be your moderator for today’s conversation with one of the leading luminaries in our field, a guy who has a lot of experience out there in the real world and also teaches and writes and does a bunch of other stuff helping people understand what to do with data. His name is Rick Sherman from Athena IT Solutions. Rick, welcome to the show.
Rick Sherman: Hi Eric.
Eric: I’ll throw out this general question to start, which is what is your philosophy of data?
Rick: Well that’s a pretty wide open one. As far as data goes, I think the philosophy that I have is that most people have a misconception of data. They think data means something absolute. The term “single version of the truth” is often used. People do think that a data point, some data means something. Really data is generally in the context of something. You have a conversation, the data, your conversation, can be interpreted in different ways based on the context if you say something or the context of the conversation. Likewise, data that enterprises use all the time, a finance person, a marketing person, a sales person, a supply chain person will take the same data and interpret it in different ways.
That’s something that I think that a lot of folks have trouble with. They think data means something concrete. First thing I like to talk to people about is the fact that data is something in context and it means something different to different people, depending on the context around it. In business, and when I first started warehousing, the idea was if it’s interpreted differently it means that somebody is trying to manipulate the data, the reports, the numbers to make themselves look better, but that’s sort of a negative view of it. Data is in context, and that’s how you need to be able to interpret the data. It’s not something bad but the reality of what data is.
Eric: That’s a really good point and you’re actually teeing us off here with a comment about semantics, it seems to me. Semantics has been one of these areas in data management that has been fuzzy for a long time and I think you just explained why that is, right?
Rick: Yes, and I think I know we’re going to talk a little bit about big data later on, but one of the things that I think people have an issue with is how long it takes for business intelligence or data warehousing projects to happen. Generally, they’re sort of enterprise in scope. That means more context, more people, more viewpoints of the data. Generally setting those things up, the degree of time it takes to set those up has nothing to do with the fact that there are schemas involved, semantics involved. If it was just the semantics of the business transaction or the conversation, it’d be trivial. The issue that you get with warehousing and business intelligence, any analytics project, is the fact that many different interpretations have to be taken into account in order to analyze the data.
That’s what takes time, and talking to people to understand what that context is, that would take your time, to programming, all that stuff is actually trivial, not trivial because that’s what I do for a living, but in the scope of how long it takes to understand something to talking to people and getting the context, that it takes the time, not the sort of physical creation of that schema or the physical coding of the dashboard or the data integration process.
Eric: I think part of the challenge, too, is that ultimately you have to make that context and all that data machine readable for the stuff to be leveraged and used?
Rick: Yes, absolutely. At least with structured data — the things that are coming out of operational systems, ERP systems, CRM systems, healthcare systems, at least you have an initial context that you can capture the data in and store it. You might have to then reconfigure it, create new schemas based on that in order to interpret differently for different business contexts or whether it’s a doctor or a patient or a specialist who might have different interpretations of the data. With a lot of the new areas that we’re getting into with social media and unstructured data, the problem that people get into is what’s the initial context that we stored it in, that we captured it in?
Then you have to interpret so many different times. That gets the problem even more, because you have the intake interpretation of context of schema, not just the various output contexts or schemas that you use later on for analysis.
Eric: All this feeds into something that our good friend, Mark Madsen, refers to as truthiness, which I always found kind of amusing term. He’s pointing to the fact that this quest for a so-called single version of the truth really is a bit misguided or ambitious. At first I didn’t fully understand what he meant but then I thought about it and realized this is a very good point, because again, depending on the context and depending on the perspective that someone takes looking at a piece of data, there can be many interpretations of what that data is, first and foremost, and also of course what that means, right?
Rick: As we’ve talked about, Mark’s very smart. I think that the initial context of people wanting to go for the single version of the truth was the fact that way back when — when data warehousing was in its infancy and people were doing data integration and combining systems, there was a lot of effort to get the data to be correct. A lot of the intake systems, ERP systems, CRM systems were all primitive, and didn’t have as much editing or checking upfront, there was a lot of manual interpretation. A lot of people were dealing with making sure if the number 10,000 was input, the number 10,000 was output. That was a little bit what people I think initially were talking about: correct data.
Not that we don’t have input errors anymore, we certainly do, and we have gaps in data, particularly unstructured data, but a lot of the checking to make sure the number is what was input is handled by systems pretty routinely, and now we’re really dealing with the next wave which is what we’ve been talking about, the truthiness that Mark mentioned: we don’t really ever have a final definition of the truth because different people have different contexts. The more information and data spreads, which is an exciting part of this industry, has been how pervasive and how expansive the use of data and analytics from the smallest and largest company, and across all industries, and people in their daily lives with their iPhones and Android smartphones being able to interpret data.
The Don Quixote never getting to the windmill, the truth, isn’t a bad thing. It’s that more and more truth or contexts are being applied today, which just means more and more expansive use of data.
Eric: That’s a really good point. We want to talk about a number of different topics today. One of them is related to something you said to me many years ago on one of our programs. You talked about hand coding, and this was in the context of ETL. You said a lot of times organizations will use hand coding because it’s fairly quick and dirty, so to speak, it gets the job done. The problem is that over time it becomes a bit of a problem, a challenge, because as workloads expand, as data volumes expand, sometimes that hand coding is just not robust enough and besides which when you do hand coding of this ETL or other code, (extract, transform, load) which has been the bread and butter of integration for many years, at least for data warehousing, well, what happens is that you don’t really from a strategic perspective have good visibility into what’s happening. You need a programmer to tell you what’s happening.
One of the jokes I’ve heard over the years is that a lot of ETL programmers don’t read policy and a lot of times fail to be aware of policy. The challenge of course is that over time, you get these increasing issues and problems and lack of visibility, and one of the arguments over the years, which I think is a sound one, has been that’s why you want to use some kind of a platform to handle at least most of the data that’s going into these systems so that you get that handle on things. What do you think about that?
Rick: Yes, I agree with you. We definitely were talking in that context. Certainly I’ve talked about, preached about in teaching both corporations and students and in my book, etc., all talked about the evils of hand coding and why we get there. Hand coding isn’t just ETL anymore, we have a large amount of application integration which really should be under the umbrella of data integration, which is also hand coded. People doing data or web services, SOA coding, JSON, XML, Python, Java, whatever they’re doing. We have the ETL way of coding with hand coding which is still going on. We have tons of application code of data being moved back and forth between cloud applications, on-premise applications, between B-to-B entities, with the consumer, customer as well as business.
As data has expanded and data sources have expanded, the problem has even been exacerbated. It’s not just in ETL; it’s also the application integration. I’ll say there’s a third wave which we’ve talked about in different podcasts, which has been spreadsheets — data shadow systems. I mean we have tons and tons of data manipulation extraction business rules that are manually coded in everybody’s favorite BI tool, Excel. We have huge waves of things and all of them have the same problems, that they need to get done, and it’s easier to do initially. It’s not documented. Different people have different contexts. The application program and the ETL people don’t have the business context or don’t understand the business policies as you said.
They also probably don’t document anything because they’re moving on to the next custom coded. The business person who probably does understand policy and business processes doesn’t understand the technology and the data integration and the consistency of data and how to make create that. They’ve got the business context but not technical context and the other folks have the technical but not the business context. It’s a problem that for all of us in the industry and continues to despite the large amount of base of data integration, business intelligence, technology and tools and applications that are out there and certainly we have a massive amount of data under those systems.
We still have a huge amount, an ever-increasing amount still sort of in the custom-coded issues which get back to your question of truthiness, that moves away from that even if we get things integrated and consistent in one space like a data warehouse or in a data lake. If it’s been pulled out multiple times, it’s probably pulled out, out of context, and we lose the context of what it is or the truthiness of that.
Eric: And you’re also speaking to this whole issue of complexity. I agree with you that the battlefield is expanding and it’s expanding quite rapidly. Organizations, if they want to maintain a significant grasp on their data, really do need to think through this complexity, especially from of course a decision-making perspective, but certainly anyone in the field of financial services or healthcare or other heavily regulated industries, they have got to really embrace the seriousness of that complexity upfront and maintain an awareness about how things are changing and what those data flows look like if they’re going to stay on top of things, right?
Rick: Absolutely. It seems like both in healthcare and financial services, I’ve been working both those industries, they do have their operational side, on medical side and in the operational side and the financial very well tight. Not so much on the BI and the analytics side on the use of the data for non-medical purposes or non-financial reporting purposes. They do have an extensive investment on the regulated side but not so much on the BI and the analytics side. I used to be at PricewaterhouseCoopers Consulting before it got bought by IBM. I was around the accountants and I’m quite surprised how many of these data custom-coded application, integration, ETL, and data shadow systems pass muster even checking off the statutes and regulations. It’s a little discouraging but unfortunately the auditors and folks can’t keep up with the data or the data complexity either.
Eric: You know, it’s funny, I’ll just give you a quick side note. I remember attending a conference in the risk management space about two years ago, and the main speaker was a special counsel advisor to the White House who was there to speak on behalf of the Dodd-Frank Act. He made a very curious comment. He just came right out and said we just can’t keep up with you guys and the innovations that you come up with and the different products that you invent. He was basically admitting that from a regulatory perspective the folks in charge of those operations literally cannot stay on top of the innovation that’s happening, which is kind of a remarkable statement, don’t you think?
Rick: And that one is a truth statement. Unfortunately that’s a very truthful statement. I mean if the people implementing it can’t keep a handle on it, it’s tough for the people that are trying to diligently follow the regulations to be able to do, too. Data has exploded. The good news is data has exploded, and there’s a myriad of uses that we do personally and professionally every day, probably all the time throughout the day. On the other hand, it is complex and has gotten out of control in some places.
Eric: Well, and you are alluding to or teeing up our next topic of conversation here, which is how can leadership in an organization foster a more data-driven culture and a culture of awareness around some of these issues. It seems to me that’s a fairly significant challenge, and I’ve always been a big fan of incentives over punishment, carrot over stick essentially, but you do need some balance. How do you advice your clients when you talk to them about how they can foster more data-driven culture in their organization?
Rick: Well, first off, we’ve talked before about the silent disruption that not all that many years ago, the effort to sort of get data consistent, data governance, analytic or reporting governance, that was all driven by IT that tried to drag the business along. The last 5, 10 years it’s the business, particularly, the more analytic functions within an enterprise. Of course that varies a little bit by industry, whether it’s a marketing or sales or finance groups or could be others, too, engineering, depending on the enterprise itself. Business folks, even if they don’t get the technology per se. I mean you know they can’t code, but they do absolutely get data and the fact that data needs to be managed and governed more.
Now whether they can actually get a handle on it and know how to do it, that’s a different issue. It’s not tough to get the business on board for trying to manage the data. Not that it’s an easy thing to do. The issue sometimes gets to be more the technology folks who see it as so complex they often don’t know where to start, and also the folks that have been, whether it’s business or IT, involved in all these custom coded stuff, they’re kind of dragging because they’re tied to the legacy systems. Even at the smallest fronts, we’re dealing with more mid-market and SMB firms, not just large firms. They all get the fact that a lot of times data drives their business, not instead of their product, but as an integral part of their product. Product offering or just how to drive sales, marketing, better customer service, more responsive products, I mean they get what data is and what it does.
The idea of the value of data and data as an asset is much easier. The issue is but how do we implement that and how do we create a data-driven culture where trade-offs are made to improve the truthiness, the data quality, the accuracy or the return on investment of the data assets or the analytics versus just moving fast? We still have a culture where we try to get things done quickly.
There are a lot of things going on. Going from the high level, we believe in data management or data governance, but we’re a data-driven company, to actually walking the walk and implementing it and doing the trade-offs, that’s where we still have a ways to go. That’s where some companies are starting to get it, certainly different executives and different other folks in a lot of enterprise places do it, but that’s the one we still need to put more effort into it.
Also going back to the complexity issue, I think one of the things that in order to get to truthiness, in order to go for a data-driven culture, you need to enable people to be more productive and to be able to get the data faster, better, cheaper, etc. Now in order to do that, systems, businesses are complex, people are complex, data is complex. It doesn’t get easier, it’s only getting more complex as we’ve been talking about. The circuitry of a chip is complex and has been getting more complex. Not convoluted, just more complex. There’s nothing wrong with that, it’s that great engineering feat. I think people need to understand that data architecture also is complex and they should embrace complexity. We need it and orchestrate systems to support that truthiness, and that means complex systems.
We keep running toward how do we make things simplistic almost. I mean we keep having vendors promise each wave of technology that you’re not going to have to do any of that schema work, you’re not going to have to do any of that data cleansing, it’s going to be easy now. We keep fooling ourselves. We keep trying to run after solutions that are going to solve world hunger as opposed to the fact that data’s hard, data’s complex, and we need to accept that and work on that as opposed to we keep going for these silver bullets like say a data lake. There’s a lot of value in data lakes. The data lake that doesn’t eliminate all this other work; it’s just a way to handle a certain set of data which then enables other complex data operations on the other side of the lake.
Eric: Well, it’s a good point and we’ve been referencing big data here and there throughout the conversation, and one of the things I really like about big data is that machines don’t lie. I think that in many ways big data represents a second chance for data management because although we did a lot of things well over the years, there were some mistakes along the way and in traditional BI environments you do have a lot of tinkering with things and you can get some pretty inaccurate looks at information overall. I see big data is almost a tonic to traditional data management for a variety of reasons, one of which is because machines just give relatively accurate information. Sometimes they don’t only because a sensor is off or because there’s some other malfunction in the chain, if you will. But the other side of the equation is that a lot of the metadata management and information about when data was loaded, where it came from and so forth, that’s all based into systems these days, into these cloud-based systems, for example. What do you think in general about the impact of so-called big data on the data management world with respect to more mature technology like data warehousing and BI?
Rick: All right, and by big data you’re implying the Internet of Things sensors and devices? Okay, as opposed to social media.
Eric: Right, exactly. Social media, you’re right, is a separate animal.
Rick: Unfortunately a lot of the focus of big data has been just social media. I’ve been talking to people that the Internet of Things and devices and sensors are the huge wave following that’s going to have much more value. As you mentioned, they’re honest. The sensor device, unless there’s a malfunction, like you said, gives you the number. All that schema and the context to whatever those measurements are, even if the sensor does hundreds of thousands of different individual types of measurements, those are all programmed in beforehand. There’s a lot of volume of data. So certainly I agree with you. I think it does give us the opportunity to have that whole pool of data to be managed much better, just because it is a machine, it’s a defined process, it is the results are known. I mean as far as the contexts are known beforehand.
Now as we talked before, the ingestion can be managed much better and it does give us a second chance there. I love that, that’s absolutely terrific. On the other side though, the interpretation of those numbers though, we’ll get a little bit into the same issues we’re in, which is we have different contexts with those numbers out. Hopefully we’ll take the fact that we’re not getting tied up on the ingestion side, because on the ingestion side we had all these ERP systems, CRM systems, people bringing them in, people bringing them out, replacing them, that took up a lot of time. A lot of cycles caused a lot of data issues.
If we’re not going to have that with the Internet of Things that’ll take that wave out of there and people can concentrate more on the output from the data, and analyzing the data. That should make it better. I also think there’s a wave of analytical tools and processes that are coming along at the same time. Everything sort of gets laid up with this big data. I’m not so sure that they’re necessarily big data specific, but under that umbrella I think there’s a lot of analytical tools that are giving us an opportunity to do things right maybe this time. We also have things like data preparation tools, which people can use, too, not just ETL tools or large-scale data integration tools. I think that all bodes well too for this generation of analytics.
Eric: Yes, I think that’s exactly correct. Let’s touch on the more unpleasant side of the conversation then is maybe a final topic to cover, which is when, where, how, and why do things go awry. I suppose we could talk about that all day, but maybe just take two or three issues that you’ve noticed from your many years in working with companies, what are some of the more common pitfalls that companies fall into when trying to really leverage data as an asset?
Rick: Well, I always say the three Ps, people, process, and politics, are the biggest hurdles that you have to overcome. The first one definitely is people and politics and the different interpretations of data, the different domains that people guard. One of the biggest hurdles, is just sort of the politics and people of the data, interpretation of the data. If you don’t believe there is different contexts of the data, then you’re going to argue what the word profit means as opposed to profit has different business rules or different contexts within the enterprise. We have a lot of that debate.
The second thing that sort of causes a lot of issues is just clinging to legacy, and we have both on the technology side, on the applications that have been implemented. People getting stuck with the way it was done or not understanding how it used to be done and sort of jerry rigging what the current system’s doing. That’s on the technology side. On the business side, with all these data shadow systems and spreadmarts out there, despite the fact that I always get the people that created those absolutely wanting to move forward; the day that it’s time to shut down their data shadow system they become the enemy of the new if they hadn’t been before that. We have a number of forces sort of working against us in moving forward.
Eric: You brought up another great point, which allows me to throw in one of my philosophies around data management, which is that the carrots are better than the sticks. The reason I say that is because the more constraints you apply to information management, the more likely it is that someone in your organization will simply circumvent the rules. I mean this happens all the time, where a process becomes too constrained and so it just simply goes out the window. I happen to know someone who works at one of the VA hospitals in this country and she said that she will just enter null values into certain fields as she is dealing with a patient because it is such a pain to deal with trying to enter that much information while you’re on the phone with someone.
What she will do is just enter null values and then when they’re off the phone she’ll go back in and fix it because she’s not even allowed to get to the next screen until she does all that. Here is such an interesting case where someone was thoughtful, they didn’t want bad data going in, so they put constraints in place that said, no, you must have this data in order to get to the next screen. We’ve all been there, I think. What happens if it causes too much of a delay or is too tight a constraint, people will find workarounds. To me that speaks to the value of being careful of where you implement those constraints and having a policy that encourages people to give you feedback, not just once a system is in production but throughout the design process and keeping an open mind about how any given constraint is going to impact the overall flow and accuracy of data, but what do you think about that?
Rick: Absolutely. And I’ll take us through a couple of contexts. One is you mentioned about just getting feedback during the development cycle. That’s critical. One of the better techniques for developing at least analytics in our dashboard, et cetera, is sort of storyboard and going through how people, you know, a day in their job or how they actually go through and make decisions. What data do they look at, what process do they go through? If you do development that way you get those kinds of things that you’re talking about. People can work with it, look at the process, give you feedback, “Well, I can’t get anything done on the phone because,” or “this doesn’t make any sense.” What happens if it doesn’t make any sense to complete the form, like you said, they’ll do it some other way. That’s sort of in the short term building out the systems.
I do absolutely agree with you on the carrots and sticks side. Too many times we’ve had IT groups, and even business groups have wanted to regulate things or control things. Certainly in the early days of business intelligence that was easy for the IT group to do because they were the only ones who possibly could write reports. With the state of technology that’s certainly not the case anymore. Going to the area of IT technology as an enabling technology needs to be set up so that business people can get more productive, rather than, for example, the IT group to develop every dashboard or report for them to get the integrated data, and then the data in whatever context a business group is going to use the data.
We used to call them data marts or OLAP cubes in memory columnar databases, whatever they are now, whatever technology you use now. We’re getting the data in the context so then the business person can do the last bit of work, which is actually the one that’s most creative and the one they’re the experts and able to do, which is once the data’s been put in context for them, for them to be able to analyze the data if they had to do self-service dashboard, use the data discovery tool or whatever tool they used, or spreadsheet. If the data is in the right context, then what difference does it make what tool they use; they’ll arrive at the answer.
If you have a pre-rigid, predefined report or dashboard and they don’t have any variation to it, then if it’s not good for them then they’ll just pull the data into their spreadsheet and do whatever they need to do in order to get their job done. I mean people, like you mentioned the person who was on the phone with the patient, you’ve got to get your job done and whatever it takes to do, you’ll do. I think too many times people worry about doing the most elegant system that they can control and forget that there is no real control out there.
Eric: Yes, perfect is the enemy of good, right?
Eric: Well, folks, we’ve been talking to Rick Sherman, Athena IT Solutions, a real leader and a visionary in the field. Thank you so much for your time today and for offering your philosophy of data.
Rick: Thanks Eric.
Eric: All right, take care. Bye-bye.