Inside Analysis

WhereScape and the Modern Data Warehouse

This is an excerpt from an interview Bloor Group CEO Eric Kavanagh conducted with Mark Budzinski, President of WhereScape USA, on December 15, 2015, to discuss the findings of the research report, The Modern Data Warehouse: Agile, Automated and Adaptive.

Eric Kavanagh: Mark, welcome to the show.

Mark Budzinski: Thanks for having me, Eric.


Eric: Sure thing. WhereScape got involved with this research report of ours on the Modern Data Warehouse. It was a very comprehensive report that our friends put together at DecisionWorx and we collaborated with them on that. There are lots of very interesting details, and it certainly kind of struck home for you in the data warehouse automation space so I’ll start the conversation by asking you for your general impressions. What did you find interesting about the report or what stood out from your perspective?

Mark: I loved, as you say, the comprehensive nature of it. I think it’s a great piece for those that are challenged in their modern data warehouse to certainly read and embrace. The idea that there is a modern data warehouse I think is really where I’d like to start in this reflection because that modern data warehouse isn’t necessarily all modern with all new technology. It very much still has to incorporate traditional technologies. I think the respondents said 65% or some big number are still wrestling with traditional SQL environments along with the new big data sort of technologies.

I found that to be interesting in and of itself. If the listener to the podcast is walking the halls going, “Okay what are the systems that we’re dependent on to run our business here and ultimately to bring analytics to the contemporary business user that’s looking for any and all tricks to get an accelerated view of that data?” Then they’re going to find that there’s a long list of these technologies that are coming together. Complexity, maybe we can call it that, is at an all time raging high right now.

Eric: Yes, I think that’s exactly right. We’ve joked in some of our promotions of this report that the rumors of data warehousing’s demise have been greatly exaggerated. You know, I’m in the media and in the analyst space and we’re all fairly forward-looking, but it does pay to realize that much of what we talk about is at the bleeding edge, if you will, or the cutting edge or the leading edge or whatever you want to call it and the fact is, that data warehousing is still going strong.

It’s in many, many large organizations today and not only are the technologies still there and still somewhat similar as they have been for years, but the methods and the practices are still going strong. And those all still matter even as we move to this new sort of hybrid data warehouse or there are other terms that people use to describe it. Still, all of those practices and technologies and processes are central to data warehousing and to generating value from data.

Mark: The notion of generating value from a data warehouse is very acute at this point. I mean there’s no misunderstanding that we have to suck value from our data infrastructure at a speed-of-light pace right now. If you think about data warehouses over the last 20 years, the reputation is not stellar. How many people raise their hand and say, “Yeah, that enterprise data warehouse in the back room, that’s good for me. That’s adding a lot of value to me as a business user or as an analytic practitioner.”

But the reality is, there are prolific enterprise data warehouses everywhere, so the question becomes how do I, right now, squeeze value quickly from that data infrastructure to the business user that’s otherwise attractive to and chasing all kinds of alternative technologies, whether they be self service or otherwise. Strategies like data warehouse automation are extremely important — here to say, “Look, the data’s there in the enterprise data warehouse, but only in part. That data has to be integrated, likely with other sources, whether they come from sensors or contemporary web-based technologies, whatever the case may be and time is of the essence.”

Doing the value-based data warehouse development in the same way that we did it 20 years ago is sort of archaic, right? That would be stupid but you’d be surprised how some of the larger organizations around the world still carry those older methodologies with them.

I think that’s the segue, and what I loved about this report is that there’s a lot of strategies for advancing your infrastructure so you can keep up with the pace of the business. Data warehouse automation is just one technology but you know, 18% of the people are, in fact, declaring that they’re using that technology today along with data lakes and Hadoop and NoSQL and these other things that are more front page news, if you will, when it comes to media exposure.

Eric: It’s interesting because I think you make a great point about the importance of data warehouse automation and that’s obviously your sweet spot at WhereScape. As I sit back and think about what software does, the whole idea is automation. You want to automate whatever you can. You obviously have to leave some areas open, some control points for manual modification and so forth, but any time you can automate something, you should automate something. That’s the whole purpose of software in general and so with a project as complex and time consuming and usually expensive as data warehousing, data warehouse automation to me is the slam dunk in terms of the different kinds of additional technologies you can weave into the mix, right?

Mark: For sure, and particularly as you look at the nature of data projects which have to be inherently agile to achieve the ultimate objectives of the project, right? This is not like we’re making an iPhone or a Tesla car where we understand exactly what the spec is and we can go build it at spec and ultimately test it at spec and deliver it at spec. Data projects are organic. Business users have a rough idea of the kind of data that they need to support their business questions and the analytics that are required and, ultimately, the data that’s going to support that set of requirements is also a little bit of a moving target.

To understand the two ends, you’ve got to be able to quickly get to an answer where you can show a business user exactly what the data will afford so that they can say, “Oh, now that I see it, actually I don’t want this red thing. I want this pink thing. I want this purple thing.” They evolve their requirements and anybody who’s ever done data projects, they could wax eloquently for hours about stories where business users declare on day one, “This is what I need. Would you guys in data land please provide me the data?”

Then when you provide the data, it’s “Oh, actually I need this other thing instead.” It’s such a moving target and if handled inappropriately through sort of old waterfall methodologies, you frustrate a lot of people. You downright make people angry, Eric, if you talk to people about this, right? But if you can take an agile approach where you’re building solutions quickly based on real data, prototypes if you will, and that gets shown to the business user so that they can iterate through their red, pink, purple declarations of what they require, automation is a beautiful way to keep up with the business and I would declare it’s mandatory.

What other major business, whether it be automobile manufacturing, agriculture, just pick anything that has a mainstream market, isn’t automated to the largest extent? Your point that says why wouldn’t we automate is rhetorical at one level but it’s absolutely true. If you think about data projects, there’s a lot under the covers that’s repeatable. Air handling is repeatable. Modeling is repeatable. Documentation of metadata is repeatable. Those kind of notions, why do I really want a small army of consultants in the room that I’m paying uppity umpteen dollars per hour to crank all that out by hand when I can automate it? Taken at that level, it’s sort of a no-brainer.

Why wouldn’t you automate? Look, if you do this report in another couple years, I would expect that data warehouse automation declarative in your report of 18% to go higher. I think you have another slide in there somewhere that says if you’re experienced at this, if you’ve had greater than N years of experience, the number goes up to 24% which is not surprising to me. Those that have done data projects before understand that automation is critical, otherwise you’re going to be left holding the bag a year from now when the business is blaming the data guys, “Why haven’t you delivered?” “We haven’t delivered because you haven’t given us the requirements and the data is not right.” It’s this big argument, right? Automation is really the way out of that sort of chaos in my view.

Eric: There’s an old adage, “When you know better, you do better,” so I think that explains why the more experienced people in the data warehousing world are using data warehouse automation. They’ve learned and there’s so much documentation that really should be done and I think anyone in the software world knows that if you don’t automate the documentation side of the equation, you’re going to have big gaps and when you have big gaps in your software documentation, it’s like missing a couple days in calculus class, good luck.

Mark: For sure and you know what? With the modern data warehouse, where we’ve got sensor data coming in, we’ve got self service, we’ve got Tableau and Qlik users doing their own thing in the department, we have data marts and and SQL Server, we have data lakes and Hadoop, we’ve got the old enterprise data warehouse in Teradata or some IBM technology, guess what? This documentation issue is even bigger. You tell me that the guy using self service at the business desk is sending metadata about the project and about the business rules back to the IT department; I mean no way.

The idea of managing metadata and ultimately documenting the environment, what table comes from what, what piece of analytics is driven from what source, is just becoming more and more challenging, and automation is one technology certainly but I would just say this metadata management documentation thing you mentioned, it’s just getting bigger, it’s just getting harder.

Eric: You bring up agile and I think that’s obviously a key concept in this report and it is a term that has now been used in various contexts so someone could say it has been bastardized at times by some people, but it is a fairly well understood concept and it comes from agile software development and just to have it in front of me, I looked it up on Wikipedia. It’s “a set of software development methods in which requirements and solutions evolve through collaboration between self-organizing, cross-functional teams.“

You mentioned earlier in this process of trying to gather requirements if you do so in the old waterfall method, first of all it takes so much time. So long compared to what we’re doing these days. Second of all, you don’t benefit from that immediacy of interaction between teams. Like you said, the businessperson needs to see some iterations of what the stuff is going to look like, whether it’s from a sample of the data or some other small view of what we think it’s going to turn out to be, they need to see and roll their sleeves up and get their hands dirty a bit with what the data looks like in order to better understand what it is that they’re going to want. That’s why that collaboration and that iteration is so critical to get to a happy place, right?

Mark: There’s no question about that. Whoever thought it was a good idea to build a data project like we were building an iPhone has obviously never done this before. Why was this so prolific for so many years is a real head scratcher to me, but if you look at the modern data warehouse and the methodologies thereof, I’m certainly pleased to report among our customer base, and I would dare say for those that are not in my customer base, as a general comment, that agile development is now accepted as the have-to-have for these reasons, the collaboration and iterative nature.

Now, 10 years ago, and even five years ago, if the data architects or the data delivery guys failed to deliver through a waterfall methodology, you were left with frustration and angry business users but you weren’t left with business users that just said, “You know what? I’m out of here. There are alternative technologies that we’re going to go pursue.” You think about it now, self-service, data virtualization and a lot of ways for the business to just basically say, “If you guys aren’t going to keep up with me and collaborate with me then I’m out of here. I’m going to go do it myself,” which, you know, is the short-term fix to any kind of a time-based data need but not the long-term strategy that any data manager IT department is interested in.

We want to manage data for governance. I mean, somebody’s going to ask us, “Are these numbers right?” Ever been in a meeting, Eric, or been with a customer where you got two people arguing about the data? “No, no, no, this is what my report says.” “This is what my report says.” One was cooked up in Tableau using that technology, another one’s cooked up in another department using a different technology, so you’ve got to get it right.

Anybody that’s public that has governed data that has to be audited, you have to get this right. So how do you get it right? You have to collaborate with the business and you do that through this iterative notion that essentially gets you to the point where you can actually deliver something that is not only of value but it’s right, it’s correct.

Eric: I’m glad you brought up Tableau, you must have been reading my mind there again because I was going to mention what you had just talked about a moment ago, namely the alternatives. Competition breeds excellence, right? Or at least it usually does and what happened seven or eight years ago when data warehouse appliances really took off was you had a case of what you just described.

Namely people were tired of trying to go through the political process of getting information into the warehouse or out of the warehouse and they said, “You know what? We’re just going to build our own supercharged data mart and that will take care of our needs.” Then Tableau comes along and it’s even easier to use and it’s a nice visual tool so suddenly you’ve got everyone in the organization thinking that they’ve got their view of the world and that view is correct.
You had this diaspora almost of analytical mechanisms being used and what does that ultimately lead to? Well, sooner or later you realize, okay guys, we do need one trusted governed source of critical data for our sales numbers, for our profit numbers, maybe for compliance reasons or so forth and that’s when I think this sort of resurgence of technologies around data warehousing really took flight.

Because people recognized that we went from having one big silo that was hard to get data into or out of to having five or six or seven different silos. We’ve got to wrangle those cats now and get them into one location and get some trusted version of what’s going on, and I think that kind of led to this renaissance in data warehouse technologies, what do you think?

Mark: I would say for sure. The balance of power, I mean, if you think of it as an organizational development kind of an issue, has changed dramatically and it is a bit of a pendulum that goes back and forth. I think we’re starting to settle nicely into the more optimal place. It used to be all IT and business was at their mercy. That’s not working, but when the pendulum goes the other way and the business says, “Yep, we’re gonna be cowboys and do whatever we want,” that has it’s own challenges as I just said when it comes to audited governance or just having the same answer to the same business questions among different departments around the building so we’re not having arguments about the data when we show up at the management meeting.

In that regard, I think it’s really important to recognize that that balance of power has finally played out with agile. IT can keep up with the needs of the business. Business is now starting to understand and reflect, it says, “Oh, that quick and dirty thing I bought didn’t contemplate historical data. It didn’t contemplate other data sources. It really was a quick and dirty thing,” so I think we can all come to agree now that says if there’s a way for the data architects to keep up with the pace of the business through agile and through these technologies that are documented in your report.

Then that is the right answer. That ultimately is a sustained answer, is probably a better way to put it. I mean, what is sustainable over the next five years so we’re not just thrashing back and forth? I think we’re finding an equilibrium now when it comes to that balance of power, the methodology of agile and the technologies, whether they be NoSQL, Hadoop or data warehouse automation, that bridges the gap between those two.

Eric: That’s a really good point. As I’m looking at this report, obviously one of the other big developments that it discusses is this concept of the hybrid environment. As you alluded to big data is very topical these days. Everybody talks about it; there are a lot of good reasons why we talk about big data, because it is everywhere.

It’s everywhere, and we now have the capacity for capturing it and working with it and wrangling it and getting some value from it but the key, ideally, is that you don’t want to do so in a vacuum. You want to be able to leverage big data assets within the context of the lens through which you view your organization and that being data to some extent at least and the warehouse as this conceptual, centralized area where your most important data sets are managed and governed. I really feel that this hybrid view going forward is going to be the answer and the key is just how do you integrate your warehouse with these auxiliary forms of data or types of data. I’m curious to hear your thoughts on that.

Mark: Look, the big data movement, machine learning technologies, is we’re way past the “what is this and should I consider it and maybe I’ll look at it some day” stage. I can say pretty definitively, it’s mainstream. There are big data technologies, there’s sensor data, there’s big volumes of data, there’s Hadoop, there’s Hive and Splice and all these SQL notions. I mean, it is everywhere but that doesn’t mean that it’s exclusive in any way.

Why is it still front page news that big data is the leading technology when in fact, your report says it’s just one of many strategies? I think it goes back to the beginning of the big data movement would be the way I’d reflect on it. It was positioned by vendors and I think media picked up on it quickly as a very high drama, almost competition that said big data will threaten and perhaps even take over the traditional SQL world. If you remember three or four years ago, I mean, that was the rhetoric. There were reports that were produced that said in what timeframe will this battle be reconciled? It was an either-or sort of an argument.

Eric: Right.

Mark: If you’re an American football fan, you understand the polarizing effect of somebody like Tim Tebow. To me, there’s a Tebow effect of this big data thing. It is really dramatic and it was polarizing for a long time. Now, as we enter into 2016, it’s not that polarizing. It’s not that dramatic, so everybody’s calming down and viewing it much like your authors have found in the report as they talked to respondents.

Though it is mainstream, it is just one of a series of things that have to be considered. The modern data warehouse is not either-or. Can we just say that in plain English? I think maybe the start of the argument that says we have to look at not just one but many contemporary technologies, so data warehouse automation, it’s not typically on page one of the New York Times in this metaphor. It’s on page 37, it’s in the back but it’s there.

Eighteen to 24% of the folks are using it and that number is going to continue to grow. It’s not as sexy, it’s not polarizing, it’s not Tim Tebow like in the way that the media captures, “Oh right, so you can just automate acquiring data. You can automate doing modeling.” It’s just not sexy, but the reality is, there are a lot of time and dollars that are squeezed out of a project and, ultimately, you can realize the dream if you will, which is the objective of a true agile implementation.

Eric: You know, I was thinking before we did this call about an interesting analogy between data warehouse automation and what we have as a reality in cloud computing. One of the things I love about the cloud is that the documentation is all baked in, and it’s partially because now the infrastructure’s there to handle that kind of thing, the technologies have been centralized in the cloud instead of having to update everyone’s latest version of Microsoft Excel or Microsoft Word or that whole client-server conundrum if you will.

Because when you load data into the cloud, all that information is captured with it. When you loaded it, who loaded it, even in Google Docs for example, you can go back and trace every change that was made in a particular document. I’m old enough to remember when Photoshop, as a strange example to throw out here, built in history. In the old days of Photoshop, if you did a bunch of changes, you could not go back and undo those.

It’s too late, you had better paid attention to what you had done because otherwise you’re were not going to be able to get there again and science requires repeatability. I kind of see WhereScape as having been on the forefront of this documentation automated movement and automating the data warehouse and we now see those principles kind of baked in all over the place, especially in the cloud. What do you think?

Mark: Yes. Cloud is another strategy and with it comes some advantages as you just articulated but you know again, there’s no perfect technology or perfect strategy. It’s not like cloud-based solutions are now all the rage at the expense of these other strategies. When it comes to automation, you still have to deal with historical data, you still have to deal with multiple sources. How does that get managed?

So you’re absolutely right, a lot of organizations are taking advantage of cloud solutions but we’re finding that in Microsoft Azure, the EDW technologies that are in the cloud now or Oracle or any of these other environments that are going to the cloud, automation is still a play.

There’s no doubt about it and if the listeners to the podcast are wondering who really uses this technology? Is this mom and pop shops? Is this mid markets? It is because they don’t have a lot of money and time lying around that they can waste. We have customers such as Costco, Testco, Nordstrom, large retailers like that, manufacturers like GE Aviation. The market to absorb these technologies and to figure out how to use them in a complimentary way with each other. Hadoop is one, Cloud as you mentioned is another.

Automation is the third one as we get into the major themes of how do we improve our situation. The collaboration iterative notion of agile development is a square solution for play with automation. If you’re serious about agile, if you’re serious about collaboration, if you’re serious about iteration, I don’t care if it’s going to be in the cloud or on-prem, if you’re a big company or a small company, that theme is best realized through automation.

Now there are other themes that are realized through Hadoop, new data sources, cost of storage. Every technology that’s in your report certainly has a wheelhouse or a center of gravity where it adds the most value but I think what I love about this report is this notion of the modern data warehouse is in fact a collection of these things. The greater aggregate of the seven or eight technologies in and of themselves are interesting but together form the basis of responding with data infrastructure to the business. At the end of the day, that’s what this is all about, and it is very, very profound; it’s terrific.

Eric: I love that. I think you’re exactly right. Let me throw one last forward-looking question over to you. As you and I look at this report, I think we’re seeing the same thing which is that there is this range of technologies that can be woven into the mix and that the modern data warehouse will use some or all of these technologies to be more adaptable, to be more agile, to provide value quicker to the business than it has in the past.

I view the data warehouse as the grounding for an information strategy going forward, meaning there are going to be all these new data sources and I love big data. I love machine-generated data, for example, because machines don’t lie.

Despite the storyline in the media everyday that machines are going to take over and all this nonsense, which comes from remarkably intelligent people, I’m still pretty sure that machines are very honest. They do what they’re told, they don’t give you a line of BS and let’s face it, some people do just speak untruths. Sometimes they do so intentionally, sometimes they do so unintentionally but machines, if you ask them the question correctly, you’re going to get the answer that they have. I view the warehouse as being the center of gravity really for a good solid analytical strategy and if you have the right mix of technologies in place, and I think automation again is a slam dunk amongst them, you’re going to be able to use new data sources without losing your moorings. What do you think about that?

Mark: For sure. If you’re arguing that machine data and sensor data and these other nonhuman sources are going to become a larger and larger part of the modern data warehouse phenomenon, I couldn’t agree with you more. There’s no question that these new sources are driving new kinds of analytics. It’s not just a matter of what a human can deduce, it’s what machines are offering. What has prevented these machines from talking to us so prolifically in the past, well, the technologies prevented it.

We’ve broken through that barrier now so there’s no excuse not to collect data from every single corner of the business that’s measured. Customer behavior, product attributes, things related to service and meantime between failure, I know that’s a big movement right now, fraud and other things in the commercial banking world. There are so many themes that we’ve struggled with in the past and now you just say, “Oh no, let’s just open the book and see what the machines have to say about this.”

I think you make an extremely good point. That’s going to get more and more and more prolific and I think that the foundation of the technology to deal with it is there. Will they evolve? Sure, they will but I don’t think it’s going to be a revolution. That’s already happened. The evolution of these technologies is probably all that’s required to keep up with it, and it will be a fun market to be part of that’s for sure.

Eric: That’s great. Folks, we’ve been talking to Mark Budzinkski, President of WhereScape. Great conversation today, thank you for your time Mark. I think WhereScape is right there and have been right there on the leading edge for some time now, so good for you.

Mark: Thank you very much.

Eric: Okay folks, take care. We’ll talk to you next time. Bye bye.

Leave a Reply

Your email address will not be published. Required fields are marked *