Inside Analysis

Waterline Data and Enterprise Hadoop

Oliver Claude, CMO of Waterline Data, sat down and chatted with Bloor Group CEO Eric Kavanagh on March 18, 2016.

Eric Kavanagh: Ladies and gentlemen, hello and welcome back once again to Enterprise Hadoop in Production. That’s right, a research report we are doing together with our friend David Loshin here at The Bloor Group and Inside Analysis. We are talking about real world Hadoop solutions in productions, not just the sandbox stuff, who is really using this stuff, who is getting value from it is what we are going to talk about today with Oliver Claude. He is CNO for a company called Waterline Data, so Oliver welcome to the show.

Oliver Claude: Thank you Eric.

 

Eric: Before we go into exactly what Waterline does and just to preface this I think you guys are really onto something that I have considered a critical component of a Hadoop strategy, but we’ll get to that in a little bit. Let’s first talk about this whole concept of the Chief Data Officer, because it is kind of funny I at first heard about it and I was like “really another CXO? Do we really need one more person at the table? Isn’t that the CIO’s role technically?” It was on a DM Radio show at the beginning of last year that Krish Krishnan of Sixth Sense Consulting and a couple of other people really changed my mind about that and convinced me that a CDO is a good idea for a lot of different reasons. It is a voice at the table for the value of data and someone who can really hopefully align people, bring people into the process, and make data more of a strategic asset.

So, I’ll ask you from your perspective, I’m very curious to know, what do you see as the CDO’s meaning for big data and Hadoop in general? Is it good, is it bad, a mix? What do you think?

Oliver Claude: I think it is very good. You know it’s interesting we have been talking about data as an asset for many years and it feels like now is the first time where there is actually an enterprise champion for data. It is interesting there has been this trend of data democratization that people have been talking about for a couple of years. And, again it kind of feels like there now is one person who has a seat at the C-level table, who can really focus on data, on helping the business unlock the value of data in a way that really wasn’t possible before. And certainly technologies like Hadoop are going to help the CDO actually put this into production and make it become a reality.

Eric Kavanagh: I think you are right about that, and obviously it depends the person and the culture of the organization. You always end up with political struggles and let’s face it, data has often been viewed from the perspective of ownership and by that I mean people are protective of it, which of course is good to an extent, but it is not good to the extent that people do not share data that should be shared. It seems to me that is probably a key role of the CDO right? To serve as liaison/mediator/mitigator to get people to actively and even proactively collaborate on data sharing and on helping other parts of the organization understand what their data sets mean and how they can fit into the big picture right?

Oliver: You are right. It is really interesting because IT has been trying to help the business be successful with leveraging data, but I think the processes historically were sometimes challenging because getting access to data was more difficult. I think the business now wants much more self-service. And so, there is a new relationship I think that is being formed between IT and the business, where IT is being more of a facilitator and the business really wants to be empowered. I think you’re right, the CDO can play a key role to help bridge that relationship all the way up to an executive level.

Eric: Yes, that’s really important stuff it seems to me because if you get a good leader who works with people and who is approachable, has a vision and a strategy, then all sorts of things just kind of fall into line and it really moves these projects forward. How many times have we heard of a data warehouse which just becomes a silo and only certain people have access. Sooner or later you get off track. Slowly changing dimensions for example, if the right people aren’t tracking stuff; it just slowly veers off the tracks. The CDO it seems to me is going to be that person who is responsible for maintaining that strategic view of data assets and data systems, and slowly but surely herding the cats and aligning people to all work toward the common goal right?

Oliver: Absolutely.

Eric: Let’s talk about this whole data lake concept. Of course Hadoop is not exactly new, it just celebrated its tenth year, as I recall of being entered into the open source community at Apache. There were some early adopters, obviously Cloudera, later on came HortonWorks, a lot of other companies now, probably hundreds of companies are jumping on different parts of this equation. I realize that here in the analyst world we talk to a lot of people like yourself and these other companies and they’re all knee deep in Hadoop.

What is your take on the uptake across the Fortune 2000 companies? Is it 10%, 30%, 50%? Can you give us some sense of how many or what percentage are really taking it seriously and moving it forward?

Oliver: It’s really interesting. We’ve been working with companies around this concept of a data lake for a couple of years. There has been a little bit of a bad reputation. There has been a lot of talk about the data lake being a data swamp. When I look at all the companies we have been working with, and they are all in the Fortune 1000 range I would say, in the last 12 months we have seen an incredible up-tick in the number of large enterprises implementing data lakes. Anecdotally, is it 10%, 20%, 30% — it would be an anecdotal representation based on that. I would say it is at least half. It is really incredible the number of large projects we are seeing getting off the ground.

Eric: That is good stuff. I think you are right the tipping point is starting to occur. Enough companies have heard about this and are investigating it, frankly moving beyond the sandbox stage. Because if you are a responsible CDO, you are not going to grab a Hadoop distro and throw it in production inside of six months. You are going to want to really play around with this stuff, understand what you are doing, understand security risks for example, which are not insignificant and understand the whole meta data side of the equation. If you think about the mistakes that we as data managers made many years ago, even pre-data warehousing days for example. There could have been so much more benefit had people really just thought through the long-term process of using these information systems for analysis, for discovery, for exploration.

What I noticed early on and I asked quite a few questions on different shows is, are we making the same mistakes again in the Hadoop world of not fully appreciating the importance of meta data, the importance of semantics, the importance of aligning concepts, things like ontologies. From my perspective, from what I have learned about Waterline Data, that is where you guys are going. That was part of your vision was to build a managed process around helping analysts explore, discover and really rationalize and make sense of data that is in a data lake, right?

Oliver: Yes, you know it is really interesting to see the transformation that is taking place in the market. I think when people started to view the data lake as potentially becoming a data swamp I think people realized that the issues you just mentioned around meta data and data governance more broadly, became really critical to be able to open up the data lake to the business. It’s really interesting from our perspective, we took a view of the data lake as a great place to store data. That is why large enterprises are leveraging the data lake approach, whether is it on premises or in the cloud, whether it is one cluster or multiple clusters. There are different ways to architect it.

At a high level it really offers some great benefits as far as putting all of the data in one place. The expectation was, it’s all in one place, now people can come and really take advantage of a self-service approach to work with the data. There has been a lot of talk about the data lake being deployed as a service to the business if you will.

What’s really missing is the layer between where the data is stored and actually allowing the business to work with the data directly. What we took to market a couple of years ago is what we are calling a data marketplace for Hadoop for your data lake. I’ll use a metaphor because I think that will help crystallize what these things really mean.

When we started the company two years ago, we thought about a metaphor to describe what people really wanted. If you think about the data lake in the context of online shopping, and I’ll use Amazon.com as an example. If you think about how easy Amazon.com makes online shopping, where as a consumer I can go to Amazon and I can search for a product, I can look at different characteristics. If I’m buying a toaster I can buy a two slice or four slice toasters. I can look at different brands. I can look at complementary products people bought. I can look at reviews. There is a wealth of information about the product that helps me decide if this is what I want to buy. Then, I can buy it and it gets shipped to me.

What’s so interesting about Amazon.com is on the back end you have processes to catalog the products, to make sure when a consumer goes to the website they can actually get the right information and trust the information they are getting about the product. Apply this metaphor to the data lake. That is what we have done with the Waterline Data product. So if you think about Amazon.com as a marketplace for products, we are a marketplace for data. We fit over the data lake and we do all the things you would do at an Amazon.com. We catalog everything you have in the lake and we have some cool technology to do it in a unique way through automation. We have a UI for the consumers of data, in this case business analysts and data scientists, to let them find the best data, the right data, the most trusted data. Then we can get it to them through a process of provisioning the data that is secure.

Eric: That makes a lot of sense and I think people sometimes under appreciate the critical importance of having that discovery layer, which makes sense and which is intuitive because as time goes by and you add more and more data into the system unless you have some lens through which you can view these data sets it’s just going to become a mess in there. It’s going to become incomprehensible. I think you really focused on a critical component of the architecture and to me if that is done right, then over time your data lake becomes more and more valuable. Whereas, if you do not have some system in place, then you are going to slowly but surely suffer from entropy, right?

Oliver: For sure, and you know what’s interesting, if you dig a little deeper when you were mentioning meta data and so forth. If you take a top-down approach to try to define meta data about all the data in Hadoop because it is so large, it is impractical to do in a manual fashion. Even though there is a recognition that the meta data needs to be there, I think we are pioneering an approach to do it that is unique through the automation that allows us to discover the semantic meaning in Hadoop and tag it automatically. That aligns with the way the business wants to work, which is they want speed to value. They don’t want to wait 6 months or 12 months for somebody to build a business glossary and then try to map that to the physical data. You want things to be fast.

Now, that being said, we do allow you to import an existing business glossary if you have invested in that and you have a process around that. We can import it into our product and we can then leverage that as a starting point. What we add on top of that are the rules to automatically recognize the pattern of the data and automatically tag the data and that is where the real time-to-value savings come into play.

Eric: There is another really important part to this equation. This is one thing I really love about this whole new cloud paradigm that is evolving as we speak. Which is automatic tagging solves a huge problem, which is that humans are frankly unreliable. What I mean by that is people just forget to do stuff, people get busy, some people simply choose to not do things, people have different ideas about how to characterize things. Think about in a Microsoft Word document for example. It was years and years ago that they included a field where you could put in those meta tags. Who the author is, what it’s about, various tags and things. I mean did one percent of the documents have that information put in? One percent maybe, I think that is overshooting the target.

My point is that by automatically tagging, even if something is tagged perhaps inappropriately, I’m sure someone can go in and untag it or correct it or align it. The fact is by automating that process you are baking in key meta data sets which can then be refined and leveraged over time.

Oliver: Yes absolutely. When we look at the problem with meta data, we’re actually trying to solve it both with automation and a human element. We want to automate what can be automated and save time, leveraging the algorithms and the rules we have baked into the product. When there is need for human intervention we let somebody go in, like a data steward, and they can curate tags. In fact there is a lot of control we give to data stewards to make sure if there is a need to lock down the business glossary and the tags, someone has control to do that.

At the same time, what we are trying to do is also enable collaboration between subject matter experts. So multiple people who own different pieces of the data can go in and tag what they know. That can also be expanded to subject matter experts in the business.

One of the interesting challenges we have seen is it is hard for one central team, if not impossible really, to tag all of this data … even with the automation. There is a need to empower more people on the business side who are subject matter experts to go in and add additional tags, add additional information. That is what we call crowd sourcing. There is an element of crowd sourcing that we are combining with the automation. That is extremely powerful because we are trying to capture not only the insight we can gain through automation, but we also try to combine that with the insight we can get from the tribal data knowledge if you will and bring it all together into one data marketplace where the business can really get tremendous insight into all of the data assets and find and use the best data possible.

Eric: That makes a lot of sense. Crowd sourcing, if it is managed properly, can be an incredibly powerful mechanism. If you think about the old days of e-mailing documents around and workflow where you needed approval from seven people. You get it from the first person and the second, the third, then the fourth goes on vacation and the whole process collapses.

Well, when you have managed crowd sourcing, and it does take some management because people can overwrite stuff. I’ve seen this not just in a metaphorical way in Google Docs, you can give someone access and they can accidentally overwrite something you put in. So some of that stuff does have to be monitored. The bottom line is, you can have teams working together on the same document, in the same space, at the same time, chatting to each other on the side and really coming to consensus. That is when you get the value. When you’ve got consensus, you have aligned your meta data in a way that provides a clear view into a particular part of the business and that is when the proper analysis can be done, right?

Oliver: Absolutely. There is always a certain level of control that needs to be there and that was an interesting thing I was thinking about the other day. We’re trying to, on the one hand, provide data governance, but we’re trying to do it in a data democracy, not a dictatorship. On one hand the business wants agility, on the other hand there needs to be a way to make sure the data is trusted, but not interfere with the work the business needs to do. These new approaches, these new technologies are going to revolutionize the way this is done in large companies.

Eric: I think so too. Now onto the difficult topic, data governance. This has been a real issue for quite some time with a lot of organizations. Primarily because these information landscapes are so heterogeneous. People have tried, for example, to control data governance at the ingest layer, where it comes into an application. They have tried to control it at the database level. I have seen some interesting patterns recently where some of these data modeling vendors are baking governance into their tools, that is an interesting approach. How do you guys tackle this program and how can organizations effectively blend big data programs with traditional data governance?

Oliver: This is really a huge topic. We have focused on tackling the key aspects of data governance that we felt needed to be addressed to enable the data lake to be opened up properly to the business. It’s a question of not whether there is data governance it is a question of what is the right amount of data governance? As well as, how do you blend more top-down data governance with more agile data governance, if you want to put it that way.

Especially around the meta data — we started to touch on that topic a little bit. We are trying to blend what we do with what has already been done in the organization. As I said, if you already have invested in a business glossary, we can import that into our product, and you could have multiples. You could have a business glossary tied to your ETL tool; you could have other repositories throughout the company. Fundamentally, we can leverage that, we certainly don’t want to recreate the wheel if there is information we can leverage going forward then that is terrific.

We are not trying to replace those tools; they play a role. We are trying to pick up where they left off, if that makes sense.

Eric: It makes complete sense, final question I’ll throw at you. Every company in the software space has a roadmap there are always things your customers are asking for. Depending upon development resources you have to choose this and not that. Obviously, we would all love to serve every client’s needs, but you have to find the centers of gravity and do what most of your customers and hot prospects want.

What’s coming down the pipe from you guys? What are you focused on for the next six months or so delivering through Waterline Data?

Oliver: We are evolving the product road map in a number of ways. We added an important milestone from a data support perspective or a data platform perspective. We added support for Amazon EMR and S3 this quarter. Until then we were supporting the main Hadoop distribution, like Cloudera and Hortonworks MapR, Pivotal, which we still do, but we added Amazon as well because we are seeing customers implementing some kind of hybrid between those.

Going forward, in addition to being more Hadoop-centric, we are also planning to go beyond Hadoop and start to reach further out into the rest of the enterprise. Stay tuned for that in more detail. We are also deepening our integration with the Hadoop distribution. For example, we just became certified with Hortonworks Appless, which is a data governance framework Hortonworks is taking to market. We also became certified with Cloudera Navigator. Because there is a data governance aspect to what we do we want to make sure we are well aligned and well integrated with the platform we run on. We are going to continue to invest in that vein. Then we are continuing to improve the product to make it even more … rich for the business to leverage the data marketplace, do more from the point of view of having this rich business data catalog that mirrors the experience you would have at Amazon.com.

That’s really critical to being able to ultimately extract the value out of the data lake.

Eric Kavanagh: Real quick, on all of our webcasts we do get some regulars who attend probably every other webcast at least. There’s a guy from the UK who is on a more or less constant screed about data glossaries and the decades of failures in making data glossaries work. The time has finally come around, and from my perspective I’m starting to see enough traction around this space, thanks in part to companies like yours. There are a few others who are doing similar things to finally get the business to wrap their heads around the importance of understanding the value of these ontologies, of semantics, of a business glossaries, so new people can come along and fairly quickly ramp up to understand what a particular business is all about.

Even if you have ten businesses in the same industry, even insurance for example, every business has its own unique culture, its own unique model, its unique process, obviously human being are all unique. Having this kind of glossary that someone can use, especially a data steward or a data analyst or even a data manager, even a senior executive, that is so valuable to onboard people and to get people ramped up quickly, to understand the nuances of that business and be able to jump in and start adding value. I have to say, I think you guys are on the right track and I just ask for any closing comments you have.

Oliver: This is great Eric, it’s been a great discussion. We look forward to participating again.

Eric: We’re looking forward to the report, too. David Loshin is working on this now Enterprise Hadoop in Production. It’s a pretty tall task. We’re out there scouring the globe to learn from people who are really doing this stuff. Anyone who listens to this podcast, if you have Enterprise Hadoop In Production, please reach out we have a survey going out pretty soon.

With that we’ll bid you farewell folks.

Thank you so much Oliver from Waterline Data. Oliver Claude is @waterlinedata.com. Check him out, and you’ll be hearing more from them in the future.

You have been listening to our latest podcast for Enterprise Hadoop In Production. Take care folks. Good bye.

Leave a Reply

Your email address will not be published. Required fields are marked *