Inside Analysis

Cloudera and the Modern Data Warehouse

Eric Kavanagh, CEO of The Bloor Group, chatted with Sean Anderson of Cloudera to discuss the results of The Modern Data Warehouse: Agile, Automated and Adaptive research report. The conversation occurred on December 18, 2015.

Eric Kavanagh: Ladies and gentlemen, hello and welcome back once again to the Modern Data Warehouse podcast series. My name is Eric Kavanagh. I’ll be your host for today’s conversation with Sean Anderson of Cloudera, which is a very cool company. A lot of you have heard about Cloudera and what they’re doing in the big data space. We’ll talk today about how they can augment the modern data warehouse. With that Sean, welcome to the show.

Sean Anderson: Thanks Eric, glad to be here.

 

Eric: Sure thing. Let’s talk about what you at Cloudera see in the data warehousing space, because obviously Cloudera is very well known for enabling analytics of all different kinds of data. Not exactly like the data warehouse, but you’re supplementing or augmenting what the data warehouse does. First of all, let’s talk about where you see Cloudera in the spectrum of the data warehouse world.

Sean: Thanks, Eric. We see Apache Hadoop and Cloudera’s capabilities to really extend a traditional warehouse environment to handle a lot of new and more complex types of data and also fill in some of the gaps around some of the current limitations of modern architectures. The timing’s actually great, I think, when you look at Hadoop’s maturation. This is one of the more exciting times as far as Hadoop’s capabilities to extend analysis outside of the data warehouse and make that a very real prospect that addresses a myriad of pain points.

We are very excited about Cloudera’s platform and its abilities around becoming a modern analytic database. A lot of that is done through developmental efforts, like I said, to really fill those gaps, and for us specifically trying to take some leadership in that arena to make sure that we are complementing in a way that creates a seamless environment outside of the data warehouse environment. Then also taking into consideration things like security and management over the entire stack, so people really have a seamless experience, and it’s not disruptive for them to bring other components to complement their data warehouse.

Eric: Yes, and there are lots of different use cases for how Hadoop can supplement what a data warehouse does. I’ve been studying data warehousing now for, I’m almost afraid to admit, about 18 years, and it really has been the lifeblood of the analytics world for decades. What we see now with the Hadoop environment is a whole different way of persisting data and then of analyzing data. You talked about constraints. I’m glad you mentioned that, because if you think about how data warehouses were designed 25 or 30 years ago, there were constraints in place that really drove the design point.

You had processors that were a lot slower than they are today, you had pipes for moving data around that were a lot thinner than they are today, and the cost component was so different. Back then if you weren’t a Fortune 500 company, you were going to have a hard time affording the technologies and the services to do a data warehouse. All of that has changed fundamentally. Do you see that as having opened up this new green field for different kinds of analysis to be done on data?

Sean: Absolutely, and I think the great work that your team has done in this recent survey really highlights a lot of those things that you talked about and compares those things back to some of the strong points of Cloudera technology. I think we’ve progressed and matured over the years to really grow out of this notion that Hadoop still is a great way to store and process large amounts of data. Now how can we think about maturing that to make sure that we have analytic capabilities that meet the modern needs of all this new data that we’re ingesting and trying to utilize.

Some of those kinds of traditional limitations to a data warehouse environment, you touched on. The cost of deploying one of those solutions with Apache Hadoop being able to be launched on industry-standard hardware on some of these newer hardware profiles without a large amount of development effort has really brought forth some amazing cost savings in the implementation phase and also when people scale that out.

I think as well the ability to scale. Data warehouse environments are absolutely valuable, but if I start to enable things like real-time data pipelines or bringing in social or streaming data, the ability for my company to scale out my data warehouse environments and meet that, is specifically hindered, and it may not be the right approach. You may be able to complement with a technology like Hadoop to do that, at a much more cost-effective and seamless manner.

Thirdly, the inability to handle some of the more complex data that we’re seeing in the landscape today: streaming data, real-time data. Just the capabilities of Hadoop around landing data on the system and transforming it into a format that is proper in performance for analysis. Most recently too, we had some great releases that addressed time series data, so we’re constantly evolving to try to address these more complex data types that companies need in order to really create a full picture of either a customer or a product or a support situation or a threat based on a multitude of data points.

The last one is around fragmented systems. A data warehouse environment is great. I can couple that with a lot of tangential technologies for ingestion or for analysis, but when I do that I essentially leave some opportunities for gaps in the way that I secure that, maybe the way that I can manage that and get a single pane of view across the data life cycle. That hinders things like doing true data governance and lineage across the system. The last point there is once we have this analysis, once we’ve done this discovery and got these great insights, the next real evolution that we see in the market is the ability to deploy applications, real-time applications, operational analytics, that utilize those great insights and can serve them up to either stakeholders inside the business, or externally directly to customers and creating products and services based on that.

Those are the most common limitations and where we’re really seeing customers successful in augmenting that with Hadoop.

Eric: You’ve brought up several great points here, so let’s dig into each of them: Ingestion, analysis, security and then applications. We’ll start with the ingestion. If you think about how data gets into a data warehouse, it obviously needs to be a fairly governed process, a tightly managed process. The same holds true for how you load data into Hadoop, but they’re two very, very different worlds. Typically a data warehouse is going to be a relational model. A lot of work has to be done in terms of designing the model, and really one of the shortcomings over the years has been that in order to deliver the kinds of analysis that your users will want at the end of the day, you really need to think about what those questions are going to be before you design the model, before you choose the data sets, before you go through that whole process.

You had to be very thoughtful about what you were going to do. Whereas with Hadoop, we hear a lot about this whole concept of schema-on-read, meaning you can just load whatever data you want into it and then worry later about how you’re going to pull it out and analyze it. I think that’s one of the key differentiators for the data warehousing model versus the Hadoop-based analytics model, is that with Hadoop, you still do have to think about how you load, you have to choose your file formats and your partitions, you have to think about what you’re going to do with it later to a certain extent, but it’s not like the warehousing world where you had to be so careful about how you structured that environment.

To me, it opens up a lot of possibilities for the Hadoop environment to fuel analytics for people who might change their minds, or might get new ideas, or might take things in new directions, right?

Sean: Absolutely. I think there’s a technology component there. We focus a lot on developments around components of Apache Hadoop, like Kafka, that really provide the capabilities to bring in lots of data, to land it and make it very easily accessible for analysis, so really inhibiting that ingestion component to get in there. Another thing that we really focused on is just general best practices, and industry awareness of some of these newer models that you’re talking about. Very recently, we partnered with Ralph Kimball, who’s the father of kind of conventional dimensional modeling, and he’s been very excited to work with us and we’ve put forth some great webinars and materials around his thoughts on Hadoop’s position to really come in and solve some of these modern problems. We recently had a great conversation with both Ralph Kimball and the team over at Kaiser Permanente about where they see the future of ETL and ingestion going, and what that looks like in the new world of Apache Hadoop.

Eric: You brought up Kafka, so I think this is a good topic to discuss very quickly to help people understand what a vendor like Cloudera does with these open source tools. One of the analysts I know in the field gave me some interesting insights on Kafka, noting there are a couple of limitations. One of which, is that you cannot be ensured of when a message gets through or even if it gets through at all, which in a sort of stochastic environment is fine, I suppose, for exploratory purposes. But if you want to make sure that you are generating trusted data for decision-making or for compliance purposes, well you’ve got to harden that environment. One of the things that companies like Cloudera do, is you guys really focus on making these kinds of open source technologies enterprise hardened, right?

Sean: That’s actually a pretty good explanation. Whenever you think about growing and contributing to open source initiatives, many of the contributors are solving very specific problems, and they have specific needs based on the type of company they are, what type of data that they’re bring in, and so Cloudera’s role is to really bring that and make it an enterprise reality for people running Hadoop in production and filling in some of those gaps. Then also, on the backside, feeding that back into the open source community, whether it’s contributions like our recent contribution of Impala to the Apache foundation or just making sure that all the great work we’re doing inside the walls of Cloudera makes it out into the communities, so we can continue to really see this adoption of Hadoop mature and make this EDW augmentation a reality for a majority of the people.Which I think definitely links back to a lot of the survey results we’re seeing, and the recent work that you guys did.

Eric: I have to say, I’m just such a huge fan of the open source movement, and I’m just absolutely fascinated with how it continues to evolve. Let’s face it, in large part due to companies like Cloudera and folks like yourself, who are involved in the community. Really it’s kind of mind-blowing how different the environment is now versus what we can now call the old way, which is what the big closed source vendors did for decades. Does it kind of blow your mind, how much things have changed in terms of how software is conceived, designed, implemented, polished, hardened and so forth versus what it was just 15 years ago?

Sean: I do, I do. Personally, I feel for our users and our customers because the velocity of which things like open source initiatives can go, it can be a little bit cumbersome. It’s a lot to keep up with. There’s a lot of innovation happening in the space. Even the ability to sit down and consume that information and apply it, specifically for a traditional organization that may need a significant ramp to make changes to their architecture, make changes to their software stack. I think the velocity of an open source initiative definitely adds to those pains, but I think the advantages are far outweighing it.

What we see for companies that view data as a strategic asset, they’re willing to consume the amount of education, consume the amount or the velocity of information, and that’s where we really try to help with professional services, with the training that we do, both to the ecosystem as a whole and to our users, to really get them there faster. I think on the one hand, open source technologies have to run at such a feverish pace, because we’re constantly trying to figure out even more complex types of data. Whereas, we saw the initial incarnation of Hadoop focused on batch processing and now moving into streaming data.

These are questions that our users weren’t asking of the technology, maybe even two to three years previous. This kind of feverish pace is necessary in order to address all of the differentiated complexities of bringing in new types of data, including that in a model that also mixes with trusted, more conventional data to make sure that we have the best view of the data so our analysts can build the best models.

Eric: Yes, you made a couple of really good points there, so let’s dive in on this analysis side. If you think about the kind of analysis that we’ve done traditionally with the enterprise data warehouse, it was largely focused on profitability, on where we’re making money, where we’re not making money, fairly traditional questions. Then if you look at this new data-driven world and all of these sources, whether it be from cloud infrastructure or from social media or you’re talking about streams and stream processing, the variety and the interesting different types of data that you can now leverage, it’s really a significant game-changer.

To me, that’s where the Hadoop environment can be a very powerful couple, if you will, or can be a very powerful adjunct to the data warehousing world, because you can take that historical trusted data and use it to understand something about your customers or your business, but then you can use all these new big data sources to really complete that picture and better understand what your opportunities are, who your customers are, what the threats are, whatever the case may be, right?

Sean: Yes, and I think that’s also driving this tangential conversation around agile analytics. The ability to be iterative with your analytics and having a very two- way communication with thinking about creating models, based on the speed back there. So as I bring in more data sources and maybe data sources that my organization wasn’t traditionally familiar with and incorporating them into an analytic model, we really have to have the ability to test a wide variety of assumptions on that data very quickly. So it’s nice then while the technology’s progressing to handle more types of data so is the leadership around advanced analytics to address some of the ways that we can optimize on that very quickly and build better analytical models with all the information.

Eric: That’s another excellent point, too. The speed of iteration is so valuable these days. Once again, if you look at the disparity between the waterfall development approach for designing software in the past to the agile approach, which is much more collaborative and immediate, where you’ve got Dev Ops guys working directly with the business, making changes every single day to mission-critical systems, again, the difference is like night and day, right?

Sean: It really is, and I think we’re running alongside the great work that people are doing, and things like infrastructure in cloud. We see a lot of people taking back-end software construction and moving it to the cloud because they have the ability to iterate on the structure and the infrastructure and the software stack very quickly. Now I think the analytics community has put forth the effort and the thought to understand how they can also take very similar approaches and apply that to analytical modeling and readjust the expectations that the business stakeholders have on receiving a report or a dashboard to something way more interactive, where they’re able to make strategic pivots in their business based on the types of questions that they want to ask that day, and very quickly look forward and understand the types of business questions that they want to ask in the future.

Eric: Yes that’s another excellent point that you bring up, because it’s not just agility in being able to do analysis. The ultimate goal is to have agility in your business, to be able to turn on a dime, to be able to change your product offerings. Even if it’s something as simple as pricing, or who you’re marketing to, or how you’re making to them. There‚Äôs a term I came across about two years ago, which made me smile because it described me, and the term was “iterative marketing.” These days you can do that because if you have the data in place, if you’ve captured it, if you’re using it appropriately you can understand very quickly if a program is working, or not working; whether it’s marketing or sales, or some other kind of business program.

Whereas again, five yeas ago it would take you weeks, if not months to even understand what’s going on, let alone change what you’re doing about it. To me, that agility in your business is absolutely crucial, and I have to think that companies, which adopt these mindsets around using data and analyzing data very quickly in an iterative fashion, they’re going to be the winners. I don’t think there’s any doubt about that, right?

Sean: And I would take that a step further and say that Cloudera as a company believes in that very strongly. We run an enterprise data hub where all data or a majority of data inside the business is incorporated to understand how we move the business forward. A great example of that is we support a large number of customers running Hadoop in production, and in supporting those customers we get tickets to their logs, we get a lot of information back from the customer based on what the challenges they’re seeing and incorporating, even our software with a larger software stack and infrastructure stack.

For us, when we go to try to make a strategic decision inside our product organization or our support organization on what we will focus on next. What are the problems that we’re going to solve toward the end of this year, and what are the problems we’re really going to focus on next year? We’re really starting to leverage data to make those decisions. So based on ticket volume or based on feedback via social media or other vectors we can make very smart decisions on how we prioritize our business. What we hope is that many other organizations are starting to use data to really refine the strategic direction of their company.

Eric: Yes, and I loved that you used that word “vectors.” That’s great word because it implies motion, it implies trajectory, it implies something other than static existence. If you think about reporting in the old days and even today, but traditional reporting gives you a static view of the world. Whereas a more multi-dimensional environment, whether it’s a dashboard or some kind of a multi-dimensional perspective on your business is going to give you a lot more value. When you start talking about things like vectors, now you’re implying time series, you’re implying understanding where things are moving in time and that can really help you identify opportunities or prevent bad things from happening, right?

Sean: Yes, I think the time is becoming increasingly more crucial. A good example of that is, if I’m an organization that is trying to use data to understand my sales cycle, and bring in a lot of different vectors to determine the health of the sales coming into my business, and also how to better facilitate customers. One of the things that we’ve seen is that when you move some perfunctory reporting to phase a quarterly report or even yearly, we see a lot of archaic expectations on that side. Once you start bringing in the capabilities that make that more real time, so instead of reporting quarterly, maybe I report monthly, or multi times within a month, then I can strategically pivot my business to address the actual health and current state of my business.

Maybe that helps me create a better pipeline coming into my business, maybe it helps me understand the adoption of my products or services and whether they’re really resonating with customers. Hopefully I can make that pivot a lot sooner, and that’s going to not only save my organization a ton of money, but it’s also going to make me more nimble to address all of the concerns around my specific industry in a very actionable fashion.

Eric: Yes, that’s great. Let’s hit one last nice meaty topic because you mentioned this: applications, the new applications. This to me is such a fascinating area for discourse, because if you think about how we got here versus where we are and where we’re going, it’s a very interesting dynamic. We saw over a period of 30 or so years, this evolution toward the monolithic enterprise resource planning solution. I often use Microsoft Word as a whipping boy because it started off as this simple word processing program and then it became so much more. It became a layout program, it became much more of an interactive environment with 87,000 different features baked into it.

To me, that was one end of the pendulum swing, and now we’re swinging back in the other direction. What’s really interesting is that with the iPhone and let’s be honest, the iPhone and Steve Jobs fundamentally changed how people look at not just phones, but applications and software in general. What you have is all of a sudden all these very small, very purpose built fine-grained applications to do lots of useful things.

What I think is going to be a lot of fun to watch is how the business world adapts to this new paradigm of application development and design and use, and really starts leveraging the infrastructure of analytics underneath of cloud computing on these mobile devices, or iPads or whatever you may be using, to design very purpose-built functional applications that help their business. I think there’s going to be an absolute renaissance in enterprise applications that leverage analysis from environments like Hadoop. I’m guessing you probably agree, right?

Sean: Absolutely. It’s interesting you used the term around an iPhone. We very much equate this concept of an enterprise data hub, which is not a Cloudera specific concept, but this idea of a single structure to do data ingestion, to do advanced data discovery, and advanced analytics and also serve contents to real-time applications. Just like with your iPhone, you don’t normally take a picture with an SLR camera and load it to your computer and upload it anymore. It’s much easier to take that snapshot directly from your iPhone and then serve it out to social media or whatever resource very easily.

A lot of times when we think of an enterprise data hub, we think about people using it in a very similar fashion. When I think about creating these applications, I think the things that are fueling that are — more data. Traditionally, we would have a very specific database, or a very specific dataset, that would be fueling an application. But as we bring in more data, and say if we launch that from an enterprise data hub, we’re really giving that analytic application the ability to leverage all the full-fidelity data. Hopefully, the analytics of that application provider will fit better.

Then there’s this idea of self-service. It’s great if we have data scientists or analysts inside the data warehouse performing all these exercises, but they are a people bottleneck. As we start to launch applications that allow more people inside the business to self-service using data to create insights, then we’re really expanding those capabilities outside the limitations of what those single people can provide. It also comes with creating better types of access. MapReduce is a great analytical coding framework. SQL is probably the most prevalent out there. But how can we also extend that to pretty much every type of analytic access that somebody would need in order to gain insights, while also creating that third-party extensibility to make sure that they can use the applications, or the visualizations, or dashboard building tools that they’re used to.

Hopefully, that just means that by utilizing all those opportunities, we’re creating better and more performant analytic apps that have better access to data, and then really trying to focus on making sure that those applications can scale to fit modern needs. If I’m developing a dashboard that my customers are immediately logging in and trying to look at energy usage or maybe the fitness of the program, if that analytic application is slow or it takes a ton of time loading, then you’re going to see pretty widespread abandonment of that tool. We see some pretty frightening numbers around a percentage of people using analytics applications. I think a lot of that has to do with the fact that they’re just not getting the performance or the access to data that they need. The more that we can empower that, the more I think we’re making that a reality.

Eric: I agree. I think those are really good points. Folks, we’ve been talking to Sean Anderson from Cloudera. I think that it’s safe to say that the Hadoop environment and this open source community is going to play a significant role over the next number of years in fleshing out what the data warehouse has traditionally done, adding new capabilities and doing new fun things. Thanks so much of your time, today.

Okay folks, you’ve been listening to the Modern Data Warehouse. Thanks again, and we’ll catch up to you next time. Take care, bye-bye.

Leave a Reply

Your email address will not be published. Required fields are marked *