Inside Analysis

Navigating Hadoop with MapR

Eric Kavanagh, Bloor Group CEO, chatted with Tomer Shiran, VP of Product Management at MapR, on a wide range of subjects on May 18, 2015. Here is the interview.

 

Eric Kavanagh: Ladies and gentleman, hello and welcome back once again to Inside Analysis. My name is Eric Kavanagh. I will be your moderator for today’s discussion with Tomer Shiran. He is VP of Product Management at MapR Technologies. We’re very pleased to have him in the studio today because MapR, as many of you may know, is one of the big Hadoop vendors. There are typically three, when people talk about the big ones — the most successful ones. Tomer is, of course, at MapR, which is a very interesting company. First of all, Tomer, welcome to the show.

Tomer Shiran: Hi, thanks for having me.

Eric: You bet. If you don’t mind, let’s dive right in to the technology and the vision of MapR, which I’ve always found rather compelling as opposed to the other players in this space. MapR really focused on aligning HDFS, a highly distributed file system, with traditional enterprise data storage. That’s not something that you hear from the other vendors. I thought it when I first heard about it and then looked into it — it’s a very clever strategy because at the end of the day, all that other data, that corporate small data, let’s call it, or traditional data, is a big part of the enterprise.

If you want to have long-term success with Hadoop, it seems to me you want to find ways to integrate big data with traditional data. If you don’t mind, just talk about what it is that you did along those lines and what the results have been.

Tomer: Sure, absolutely. Like you said, integrating a new platform into the existing environment is very important for our customers. Take a step back, what MapR is all about is really about providing a big data platform that is enables real-time use cases. The enterprise is great because at the end of the day, people are putting important data in these platforms inside Hadoop. When you’re looking at putting your important data in a platform, you want it to be highly available. You want to have the data protection. You want to have disaster recovery. You need multi-tenancy. You need all these capabilities that maybe you take for granted in enterprise storage or in traditional relational databases.

What MapR has done is we’ve brought that in to the world of Hadoop and the world of big data. We’re the only ones who have done that. Allowing you to store your data in a safe and secure way is really important. Then, the other thing is, being able to do real-time data processing and analytics. What I mean by that is, being able to analyze or process the current data. Not the data that’s an hour old, but being able to respond to events as they’re happening. That’s critical, and that enables some really, really important use cases for our customers.

Eric: It’s interesting because when we talk about Hadoop, at least Hadoop 1.0, but even early stages of Hadoop 2.0, meaning YARN enabled — this new interesting component, yet another resource negotiator. When you talk about MapReduce and Hadoop in the traditional sense and even in these early stages, you don’t hear too much with respect to real time. Hadoop, at least, was initially designed and really geared much more around batch as opposed to real-time data processing. Can you kind of talk about what you’ve done to enable that?

Tomer: Absolutely. You’re right. What’s exciting about Hadoop 2.0 is that it enables a variety of different execution engines to run on the cluster and share the same resources. That’s why we have YARN as part of our distribution. At the same time, you also have to make the data platform support these real-time use cases, right? That really hasn’t changed from the beginning of Hadoop with our solution. With other Hadoop distributions, it’s really kind of a batch data platform underneath. You kind of have to upload data into a platform, and then you can use that data.

With MapR, you can have your applications. Let’s say the application server that’s generating log files or maybe an operational application that needs a large-scale database. You can run those applications directly on the platform and have them create their data directly inside the Hadoop environment, as opposed to batch uploading that data once an hour or once a day.

That really enables you to implement these real-time use cases where data is coming in in real time and also being processed in real time or analyzed in real time. Those are things we’ve really focused on enabling.

Eric: It’s very interesting. As you were just describing that, my wheels were turning. I was thinking to myself that it sounds like what you’re doing is slowly but surely spanning the gap between what HDFS was at the outset, which is just a file system and what most businesspeople think of as a database, which is, let’s face it, a very different thing from HDFS per se. Because of this architecture that you’ve started out with and have been refining over time, you’re now increasingly able to deliver on some of the kinds of functionality that a database would traditionally provide. Is that a fair assessment?

Tomer: That’s absolutely true. In fact, it’s even more than that, I think. If you look at the roots of HDFS, it was built as a system to index the Web, right? It’s more like an FTP server as a repository of data. It’s not a read/write file system. What we’ve done at MapR is we’ve actually built MapR-FS, which is our distributed file system that, of course, supports the Hadoop API and the entire Hadoop stack. At the same time, it also exposes a POSIX NFS interface so you can mount your Hadoop cluster just like you would mount a Net app or an EMC Network attached storage and have applications talk to it directly.

That was the first phase, and then once we had that read/write data platform very scalable underneath, we were able to build MapR-DB, which is an Hadoop, NoSQL database that runs as part of our platform and enables you to run real database applications on Hadoop. The alternative, really, is to run a separate system for your operational application, right? You have the database on one side and then Hadoop on the other side. You have to do this really complex and large ETL process, and that’s becoming increasingly difficult, and sometimes impossible, just because the volumes of data have grown so much. Even in the old world, when data wasn’t that big, ETL was a nightmare.

Eric: That’s right.

Tomer: That’s even more so now.

Eric: Yes, this is the point that I kind of wanted to get to from the outset, which is to say, if you go with the Hadoop implementation that is not designed from the outset to align with traditional data storage, enterprise data storage as we’ve all come to know it for the last 20-30 years, you are going to have to do some kind of surgery down the road to stitch stuff back together.

One of the movements I’ve been cheering for ii for the last 10 years is to slow down on the ETL. Ideally, you want to leave data where it rests. You want to do calculations closer to the data because data does have gravity, as some people have pointed out.

By taking this approach, you have this long-term vision of where this is going to fit in the enterprise. I think people who appreciate that approach are going to save themselves time and headaches, and let’s face it, a fair amount of pain down the road because they’re not going to have to do that kind of surgery, right?

Tomer: Exactly. You can’t have both ETL and agility at the same time, right? By definition, your ETL means you’re doing all this pre-processing before you can actually do something. Data is growing and businesses are moving faster and faster. A lot of our customers now release a new version of their application every day, and they want to explore new data sets and iterate really quickly when you can’t do that, and at the same time have the requirement to pre-process all your data.

It’s just impossible to do those things, so you have to build technology that enables you to eliminate the need for the ETL. We’ve done that by developing a data platform that supports random reads and writes, and has a standard storage interface so you can integrate it directly into your environment and not have to ETL the data into that environment. Also, with the Apache Drill, which is a new project that has just been released with its 1.0 release enabling ad hoc data exploration and analytics in a much faster way. You can iterate on your data in a self-service way without having to get IT to pre-process and prepare the data before you can actually query it.

Eric: This Apache Drill is fascinating. I first got my feet wet, if you will, in the Apache Drill world by a good colleague of mine over at Simba Technologies, George Chow. I guess those folks are working on the ODBC drivers for Drill, and I want to get into that in just a second because I think it’s absolutely fascinating what Drill is up to. First, let’s close the loop on the database side for the database cynics out there who want their data to be ACID compliant.

For those who don’t know, those are the characteristics of transactional databases, where you have atomicity, consistency, isolation and durability. In the NoSQL world, we hear all about this eventual consistency stuff, which means we’ll sacrifice consistency to get speed as you get through it. Then after we have our big moment or our peak of activity, then we’ll go ahead and dot i’s and cross t’s down the road. That’s the eventual consistency. What MapR has built, is that already ACID compliant?

Tomer: Yes, we don’t really believe in the eventual consistency model. Our customers tell us that eventual consistency really means eventually inconsistency down the road. All our technology is 100% fully consistent within the cluster. All copies of the data are always consistent. We support asynchronous replication across data centers. That gives you the best of both worlds.

The interface to NoSQL databases is different from the interface you have with a relational database. It’s not exactly the same, but certainly you have the ability to do atomic operations. You can update multiple AD values in the record at the same time at a transactional level. Everything’s 100% consistent. There’s never any inconsistent state that you have to repair.

Eric: That’s really good stuff. Again, database technologies, granted, these are the old players who’ve been working on issues for decades now, so they’ve had a lot of time. Of course, we’re only in year eight, I suppose, of the Hadoop-enabled world in terms of the broader market. I guess it was 2007 when Google, getting it from Yahoo, turned it over to the open source community. In seven years, it seems like we’ve made a good deal of progress.

Truth be told, what you’re talking about, at least in my perspective, is the larger market. It’s the ability to enable companies to do lots of different things with not just big data, but combined with their traditional data. I’m guessing that’s the goal, right? You’re moving further and further toward enabling organizations to do as much as they can and really whatever they want from a data perspective, using the underpinnings of Hadoop as the engine, right?

Tomer: Yes. When you think about databases, if you have a database that can occasionally lose data, I think it really limits the kind of use cases you can target with that platform. Customers aren’t looking for deploying 10 different types of databases. They’d rather use the same technology for a variety of use cases. You really want to have a database that can store your data in a safe way and that you don’t have the risk of asking for or doing the lookup on a specific key and getting back two different possible values. How do you even deal with that in your application? It becomes really hard.

It’s okay for some use cases, maybe looking at some log records from log files. Who cares if you lost one percent of the log records? Maybe that’s okay in one use case. For the broader market and for most of the applications that people are building, that’s just not acceptable.

Eric: I think what you’re talking about, just so people understand the gritty details underneath, you’re referring to when a node fails in Hadoop, right? When a node fails, then a separate process has to kick in to repopulate that node. If that’s the node where a lot of the data was coming in, that’s on the ingest side. That’s where you could possibly lose some data in a traditional, if I can call it that, Hadoop environment. Is that correct?

Tomer: I was referring more to the kind of eventually consistent NoSQL databases.

Eric: Oh, okay.

Tomer: Where data is not easily replicated and synchronized with the replicas, right? Then you could have a situation where a node goes down and two different nodes have a view of what the actual value is supposed to be.

Eric: I got you. Good.

Tomer: Different results based on how the request is routed and things like that.

Eric: This ties back to what you were talking about at the top of the hour, which is those real-time use cases. If you’re talking real-time, real-time consistency does not work with eventual consistency, right?

Tomer: Right.

Eric: Okay, good. I know you have some announcements, so we can get to that in a second. There is this very curious development of the open data platform. It was announced at Hadoop World a couple of months ago, where Hortonworks came to the stage with IBM, Teradata, SAS and Pivotal, and a bunch of other people. They talked about this open data platform.

I’m kind of scratching my head as I was reading about it, and did some research into it, reading some pieces by Merv Adrian over at Gartner and thought about it a good deal myself. It’s interesting. I’m not sure what to make of it as of yet. Quite frankly, I think that from the sales and marketing perspective, it’s a very savvy strategy that they’ve pulled off. In terms of what’s actually happening, I wonder exactly what’s going on.

I’ll throw out my personal theory, and then you can let me know what your thoughts are. A lot of these projects are spinning out of the Apache Foundation. There’s a lot of tension between Hortonworks and Cloudera, and there are a lot of projects that have committers from both of them. I kept thinking to myself, “This is going to fracture sooner or later.” Part of me is thinking that the ODP, as it’s called, is almost like an interesting counterpart to Apache. Hortonworks is going to try to have their own version of the Apache Foundation while still being involved with Apache.

The strategy from their perspective, I’m just guessing here, is to rally their troops around it. Also make some money from big guys like IBM and Pivotal, who probably don’t want to have to worry about maintaining their own distributions. What are your thoughts about this, and where do you see it going?

Tomer: I think a lot of what you said is really how I view the open data platform. To start with, it’s absolutely redundant with the Apache Software foundation. The purpose of Apache is to enable different companies and different organizations to come together and work on an open source project on an open source technology together. Apache has a variety of different governance rules and processes by which somebody becomes a committer, which means they can contribute code to that project. They have to prove themselves before. That is exactly what Apache is serving.

Judging by the last seven or eight years and the pace of innovation in this industry, I think it’s doing a very good job at that. It’s hard to see how you would do that in a better way. If you compare this to the traditional standard committees and all those kinds of things that move much, much slower, Apache is doing great. I also think the ODP is solving a problem that doesn’t really exist. One of the claims is that it will help address vendor log-in and things like that.

Well, we’ve been in this market, with a product in the market for almost five years. There have been many, many customers who have switched between distributions and at this point, proven that the switching costs in the Hadoop space are much lower than they’ve been in any other market. Even if you look at networking, or storage or databases, Hadoop is much lower switching cost. There isn’t a problem there that needs to be solved.

Finally, right now if you look at the participants in ODP, the two biggest Hadoop vendors in terms of paying customers are MapR and Cloudera. It’s like if you had the three big airlines in the U.S., which I think are United, American and Delta, and one of them said, “We’re creating the United Airline Group,” or something like that, and the other two weren’t participating. It’s laughable, right? I think it’s similar to that. You just don’t have the participation here of the actual players in this space. Pivotal, really, was just looking to get out of the Hadoop market. They’ve had no success in terms of getting new customers or getting any market share. That’s how I view the ODP.

Eric: It was a very interesting development. I’ll certainly be curious to see how it pans out over time, and what the long-term effects will be. Only time will tell. Obviously you guys are doing some cool stuff, and I guess you have some announcements. I’m curious to know, what’s coming down the pipe?

Tomer: We’re announcing the general availability of Apache Drill 1.0 in the MapR Distribution. The Apache Software Foundation is releasing Apache Drill 1.0. We’re a major contributor to that project. MapR, the company, is also including Apache Drill in its distribution and offering support to our customers. This is a project that we’ve been working on for almost three years, and it’s something we’re extremely excited about. It’s all about enabling self-service, data exploration and analytics.

Really, when you look at Hadoop, how do you expand Hadoop so that it’s accessible to a broader audience, not just people who are Java developers, right? How do you enable the entire community to get a benefit from this? What Drill does is it enables business users and analysts and data scientists to really take advantage of Hadoop, provide them with a familiar SQL interface. They can use standard BI tools.

We also got rid of all the traditional overhead of an analytics platform, so there’s no need to load data. There’s no need to create and manage schemas. There’s no need to pre-process the data or ETL the data. We’ve kept the good things about query engine technologies, which is being able to query a lot of data fast, but we’ve gotten rid of everything that really never provided value to anybody — all of that, what we call overhead. Companies are really in need of much more agility, and they have to move a lot faster.

Eric: That’s really good stuff, and I’m very impressed with what I’ve heard about Drill. Once I heard about what it does, I thought, “Wow, this is good stuff.” Basically, if I understand it correctly, and feel free to correct me otherwise, but it’s a sort of schema-on-demand-type query where you don’t need to come up with a schema before you do the query. Instead, the engine itself is making some determination as to what the schema might be and thus allowing people to do all kinds of interesting ad hoc queries, right?

Tomer: Right, exactly. You can, for example, take a directory of JSON structured log files and just run a query and select star or select timestamp on that directory, and that query runs. This required a completely different architecture than anything that’s ever been done, because what Drill is doing underneath the hood is it’s recompiling your query on the fly as it’s seeing the data. It can do that multiple times as the query is running. Typically, a query engine has to know what the data looks like and compile the query at planning time, based on knowledge of the schema.

This is a very, very different architecture, but the beauty of it is that now you’re just fueling modern data sources. All these non-relational data stores that have come up in the last five to maybe five plus years like Hadoop and MongoDB and these cloud data sources, like Amazon S3 or Google Cloud Storage or Azure BLOB storage. All these data stores are not the typical relational data store where you have a schema and you have a SQL interface. They don’t have a SQL interface, and they don’t have schemas. Drill really is what finally enables the end users, the business users, the analysts, to query that data and use the BI tools that they love with these platforms.

Eric: This is very interesting stuff because the latencies are what kill projects, it seems to me. The human brain only has so much patience for latency, for hurdles and challenges, and difficulties, and so forth. Some people love to sink their teeth into a challenge, but when you talk about the process of building a schema to satisfy some described business use — well, guess what, that stuff is pretty difficult to get just right. I think that’s been one of the hold ups in the devalue of, first of all, business intelligence, but now even analytics, which stretches to business processes. To me, this is very compelling.

I’d like to understand, if we could get into a bit of detail, what’s going on under the covers. In other words, this engine that Drill provides, is it making some presumptions or is it trying to make some determination as to what the end user is trying to figure out? That, I’m sure, is based upon some rules that look at the data types that are coming in and maybe the number of data sets and some other basic parameters of the data. Then, as you suggest, you’re able to iterate.

That’s the key, right? To throw something out, to get something back and then to be able to iterate over a very short period of time — maybe two, three, four different sessions — to where you start getting a picture of what’s going on. It’s in the correlation of these different fields and different data steps that you usually find the real value. Can you kind of explain how Drill is actually doing what it does?

Tomer: Sure. A lot of data today, in fact most of the data that people interact with, is self-describing in nature. If you look at a CSV file, there’s a bunch of fields in each row. If you look at a JSON file, each of those JSON documents has a bunch of fields defined in that structure. The same is true for Parquet or MongoDB or HBase or all these data sources; they are self-describing. If you were to look at the data, you could kind of make sense of that data.

What we had to do is we had to build a query engine that can do what that human being was able to do, but to do that also in a way that works for trillions of records and works really fast. Drill is actually the first distributed query engine that, instead of having a relational data model where you assume that the data has a fixed set of pre-defined columns, Drill actually has a JSON data model. The JSON data model is suitable for representing any kind of data, whether that’s flat fixed data like a CSV or a schema-free data like HBase or JSON records or complex data like Parquet. All those types of data can be represented internally — conceptually, logically — as JSON and cannot be represented conceptually or logically as relational records.

We built the first query engine that featured the JSON data model, and that’s the first step in terms of making it easy for the user, but then underneath the hood we had to make it really, really fast. How do you make something that goes really fast while providing all this flexiblity? We actually built the first columnar query engine that supports complex data. That’s actually never been done before.

There have been columnar databases like Vertica, for example, that are very good at what they do. Then, there have been systems like MongoDB that have a JSON document model. We bring that together, combined with the scalability of Hadoop. You have this massively parallel, very scalable query engine that supports a JSON document model, but also runs at columnar speeds.

Eric: Wow.

Tomer: Then we do all sorts of other things, like automatically recompiling queries and all sorts of optimizations around bytecode, rewriting on the fly, using multiple bytecode compilers and our own specialized memory management so there’s no garbage collections, and the footprint is smaller than what you would normally have. There are all sorts of, obviously, other things that come in here.

Eric: Yes, and I think one of the benefits of JSON is that it really has become the de facto standard now for data. There was a guy who spoke at a conference I went to last year who said JSON is the JPEG of data, which I thought was a very clever way of characterizing its popularity. The thing about JSON is that it does have descriptors, right? It comes with its own packaging such that you know what the data is. I’m guessing that’s one reason why you were able to achieve as much as you have with Drill, is because JSON is the de facto standard, right?

Tomer: It’s true. JSON has emerged as what XML tried to be, but was too complex and too difficult to deal with. JSON has actually emerged as the de facto standard way to represent data. If you looked at how are people creating log files or other data and exchanging data between companies, the vast majority is JSON based.

Then there are other formats like Parquet, which is a very, very high performance columnar representation. It’s also self-describing, so it also has the labels for the data as part of that data. There’s different ways, different formats, but they have a lot in common in terms of the richness of the data model. You could think of JSON as that super set that everything can be logically described as a JSON document.

Eric: Well, that is really cool stuff. I would encourage anyone out there who doesn’t know much about MapR to hop online, it’s MapR.com, right?

Tomer: Yep.

Eric: You have this very interesting approach. I think it was very wise of you to focus on enterprise storage, HDFS, and enable these kinds of real-time use cases. I say congratulations, you are doing a great job. I can’t wait to learn even more about what you’re doing because I honestly think this Drill project is a watershed event.

Over the last 20-30 years we keep hearing about the wishes and the hopes of business intelligence and analytics. There are lots of good success stories, but I think the single biggest hurdle next to data quality has been this need to develop schemas which are very complex, and can be very difficult to build, at least in a manner that satisfies the business use, for all kinds of reasons.

One being, businesspeople a lot of times don’t know how to describe exactly what they want. By shortening and really kicking that hurdle out of the way, it seems to me that Drill has a great deal of promise, if nothing else, for changing how people look at data and changing the process by which they get value from the analysis of their data, right?

Tomer: Exactly.

Eric: Wow, that’s good stuff. Well, folks, thank you so much for your time and attention. We’ve been talking to MapR. This is wonderful stuff. It’s Tomer Shiran. Thank you for your time.

Tomer: Yes, thanks a lot. Thanks for having me.

Eric: Okay, folks, you’ve been listening to Inside Analysis. Take care. Bye-bye.

Leave a Reply

Your email address will not be published. Required fields are marked *