Inside Analysis

Enterprise Hadoop in Production

Legendary data warehousing and analytics guru Dan Graham of Teradata was recently interviewed by Eric Kavanagh, The Bloor Group CEO, about the evolving Hadoop ecosystem and how it impacts the world of data-driven business.

Eric Kavanagh: Ladies and gentleman, hello, and welcome back, once again to Enterprise Hadoop in Production. My name is Eric Kavanagh and I’ll be your host for today’s conversation folks, and we’re are quite fortunate this afternoon to have Dan Graham on the line from Teradata. He, I would argue, is the man of a thousand stories, he’s been in the industry for a number of years and really knows what’s happening on the front lines, and he also has perspective. He can see where we’ve come from, where we are today, and where we are going. We are very pleased to have Dan Graham of Teradata. Welcome to the show.

Dan Graham: Thank you.

Eric Kavanagh:
 Let’s just walk through some of the changes in the marketplace these days. Obviously, you’ve been following this industry for quite some time now, you’ve got a lot of perspective that frankly I think a lot of younger people just don’t have, for no other reason than they weren’t around. They didn’t experience a lot of this stuff, the dot com bust, for example. We were just talking about before we hit the record button … I look at Big Data, and how much it’s changing things, not just on the technology side, but also in the business side, and the consulting side of the equation, and so forth. What’s your thought on how big of a change Big Data has brought already on the information landscape, and where are things going in your opinion?

Dan Graham: I think the number one thing that Big Data did was it raised the awareness of data analysis all over the world and in every location inside the company. It use to be as Teradata Corporation, we’d go in, we’d have to persuade people, and push them, and show them all these use cases and ROIs so that they would be interested. Now that’s not the case anymore. Everyone from the top to bottom of these organizations is concerned about Big Data. They have some idea of what it is, when before they wouldn’t have paid it any attention. What does that mean? That means that there is funding. That means in the executive ranks there is somebody there that’s interested in Python programming, he doesn’t know what it is, but he’s interested, because he knows that he’s got to get a hold of the people that do that. The biggest change was really not technical – it was cultural.

Eric Kavanagh: Yeah. That’s certainly a good point.

Dan Graham: Does that make sense?

Eric Kavanagh: That’s a really good point. It’s great news for people in the business, now you don’t have to spend so much time trying to persuade someone, now you can kind of get down to the nitty gritty of how can we use this to your advantage.

Dan Graham: Right. The second big change that really brought was that now we have a world full of Java, and Python coders who care about deep dive analytics. Before that was something that those data warehouse guys did and we didn’t care about them, but now that it’s within reach of the Java programmer, they are out doing aggression analysis change and all this mathematics. That’s a pretty big change, from “I don’t care at all” to “I’m all in on this analytic thing” – which basically quadrupled the number of people who could actually do advanced analytics.

Eric Kavanagh: That’s a really big deal, too, because if you’re for want of resources, mainly human resources, that’s a huge hurdle to overcome. Whereas, these days, like you say, there are developers everywhere. I really think that the rise of the developer has had a very positive impact on breaking down what was that whole business IT divide of many years. What do you think about that?

Dan Graham: I think that those developers in particular, their job on Python programmers, are really a little bit young in terms of dealing with these issues, the cultural issues of working with the business at this level on these topics. I think the good news is that there is a bigger army out there trying to do these things. I think the downside is that they are not as skilled at it, as maybe the business intelligence and data warehouse people have been over the last decade. That’s okay. As soon as you get an army moving in a direction with this many people, things change, things get improved, and it’s a learning process on both sides of the fence.

Eric Kavanagh: You bet. On the data management side, obviously, the rules have changed certainly with Big Data. Ideally you want to leave it where it is. You want to try to do this analytics on the edge type stuff that everybody is talking about. The movement side of the management side has changed a fair amount, too.

Dan Graham: I think there’s been some very interesting ways Big Data has affected the data management and I’m going to, again, not go to technology, and as much as I am going to say we’ve redefined what it means to have data management. That wasn’t the intention, but the result is the same, which is data management use to mean you put everything in the data base, and you were done. Now, it means, I can manage my files. I’ve got Spark and Hadoop managing flat files, doing queries, and parallels, and that’s really the secret. The real secret is, we can run queries in parallel on flat files, and this is truly amazing. This is a true innovation. We didn’t, nobody even thought of this ten years ago. It wasn’t even in the dreams of data management people. We all had to put it in the database, in order to use it, now, we’ve got the big files, if there are big files, that we can do parallel queries on, and that is the big change, that Big Data brought us.

Eric Kavanagh: Yeah. Parallel computing, that’s one of the biggest changes brought but this whole open source movement, really, and of course how Hadoop came out of Yahoo, it’s the mechanism by which they index the web, that’s a very particular use case, but now what you are seeing, since it has been open-sourced, and really embraced by many of the major players in traditional enterprise software, you’re seeing a greater awareness it seems to me with every passing week around how to use this technology platform. It’s really an ecosystem, because there is Hadoop and now Spark has frankly taken the industry by storm as this execution engine for all kinds of Big Data. I think we’re still kind of at the beginning of understanding just how useful this whole new tool set is.

Dan Graham: Right. Absolutely. What’s it doing? If you have this parallelism on your flat files, you can change your service level agreements. You can perform analytics, you can perform batch processing, X times faster. If you have ten nodes, it’s ten times faster. If you have a hundred, it’s a hundred times faster. That means that it’s your ability to perform is radically changed, and then you have to find work loads that actually fit that. Now, up until five or six years ago, I don’t think customers really realized that they had workloads that were that big. They had to explore. They had to learn. There are some workloads in the data center that fit that mold, very nicely. There’s a lot of interesting things about Spark, and there is certainly a lot of interesting things about Hadoop, but it’s the parallelism, that’s making a difference. Otherwise, Big Data would have never happened.

Eric Kavanagh: Yeah. That is a really good point. The parallelism, it’s a game changer. There are lots of game changers out there. Spark, for example, really electrified, pardon the pun, the entire Big Data landscape, because of how quickly you can crank through this number crunching. That’s a very seriously big change in the way that we look at Big Data and the possibilities that we even consider, and I think one of the hurdles right now is people wrapping their heads around the new kinds of questions that they can ask, and the new things they can do with data.

Dan Graham: Absolutely. Even trivial mundane things, there’s like backing up a hundred terabytes of data, it’s something simple like that, when you don’t have parallelism, it’s not simple, it’s very painful. Spark comes along, Hadoop comes along, and I think you mentioned Spark is obviously the killer. It’s definitely stepped in and said, “you are not going to do that anymore are youThat’s a bad idea.”  We knew at Teradata a long time ago that the product was fundamentally a flawed algorithm, but we didn’t have anything else to substitute, until Spark came along, and said, okay, we’re going to do the same thing did, we’re going to do more of it in memory than not, and we’re going add relational operators to Spark, so that it can be easier to program. Wow. What a concept.

Eric Kavanagh: Yeah. This is just amazing stuff, because it really is opening new doors, and it’s changing the way people view data management and view analytics, and I just love this stuff. There is this lingering question around the impact of this whole movement on data warehousing, per say. I am curious to hear your thoughts, and I’ll not color the question too much, just to get a clean shot at what you think about the impact of Hadoop, and Big Data, and this ecosystem on data warehousing as a disciple itself. Where do you see things going in the future?

Dan Graham: Of course, as part of the Teradata company, you would expect me to say data warehouses are going to live forever. I would say that, because that’s what Gartner is saying, that’s what Forester is saying, that’s what all of the major analysts houses are saying is that about eighty percent of your work load in any shop is going to go through the data warehouse for a variety of reasons. For the fact that the data is integrated, and it’s clean, and scheme on right. For the fact that there is high concurrency. For the fact that there is a huge ecosystem of beautiful tools that are mature, working with the data warehouse. We’re going to be here for a good ten, twenty years before anything can really affect us, in the long term, that doesn’t mean that Hadoop doesn’t have a useful future. Spark has a useful future.

I’ll go on record saying, it would be a disaster if Hadoop, and Spark were used to replace the data warehouse. We already have a good data warehouse technology, we have it at Teradata, we have it at IBM, and we have it at at Oracle, so why do we want to replace that with the open source? Is it purely a financial thing? Which doesn’t actually turn out to be that credible sometimes, I think Hadoop, and Spark have a much better future doing the data link stuff.

Doing things that the data warehouse cannot do, and shouldn’t do, this is kind of like saying, should I try to make my Ford pickup truck be a passenger car, or a bus. That’s a bad idea. It’s very similar to, should I try to make Hadoop into a data warehouse. Hadoop has got a big future. It needs to expand into it. The data warehouse has a big future, and it needs to keep doing what it’s been doing.

Eric Kavanagh: Yeah. I think that there is a tremendous amount of engineering work that was done in the data warehousing space. Which by no means should be lost. Right? Because if you look at what is happening now, with the Big Data space, no doubt, lots of interesting stuff is taking place, but I also kind of see us making some of the same mistakes, it seems to me that metadata was once again the red headed step child, they are now trying to work on that, there are various projects that are focused on that, but nonetheless, you kind of see some of the same errors being made in this new movement that were made, twenty, thirty, forty years ago. That’s got to be a little frustrating.

Dan Graham: There’s a little joke going around that maybe Google should have Googled Teradata at some point. That in fact, if they would have read up a little bit on parallelism, they would have had better algorithms, and eventually they started evolving towards that, but we don’t want them to Google us, too much, because we are a certain distance ahead. We don’t need anyone to catch up to us, and certainly is the possible, if they pay attention. I don’t know, the market in this area, there is a lot of immaturity, there’s a lot of areas, let’s give an example, one of the things that Spark is struggling with is memory management.

All the Java programs struggle with it, as well. If you just running a small server, and a small Java application, no problem, but Spark, by its nature is working with Big Data that blows out the memory, and then boom, it crashes. This whole notion of fill to disk, and cashing, and all this stuff was perfected in the eighties, and nineties, by the database vendors, Oracle, probably did it first. Teradata was the first to deal with flooding memory with huge table scans, and coping with that. Which is what Spark is coping with. What’s interesting is yes, they are repeating the mistakes of the past, and yes, in some cases coming up with some new ideas, too.

Eric Kavanagh: Yeah. That’s a really, really good point. I think, you’re articulating some wisdom that would best be absorbed by many people in the industry as we go forward, because a lot of these things have already been learned. Right? There is no need to learn things the hard way. Learn the easy way. Like you say, do some research, and figure out who has already cracked the code in a particular space, and you can certainly help yourself out, significantly, and not really lose so much time, because time is so valuable, it seems to me these days, the more than ever, if you look at how quickly some of the new analytics driven organizations have taken off, and taken flight, and then just dominated their industries, you can see that time is very valuable. It’s even more valuable than it was ten, twenty years ago, it seems to me. What do you think?

Dan Graham: We definitely live in internet years, so it doesn’t take very long before there’s a rocket coming out of Silicon Valley, or some other area that nobody can catch up with. We’ve seen it with MongoDB, for example, Mongo, and their strengths with dealing with data, they’ve just been a phenomenon, and nobody saw it coming. I am not even sure that the people of Mongo could have dreamed of this big when they built this product. You are right. There’s this acceleration and time compression that has to be dealt with.

Eric Kavanagh: You bet. Let’s talk about some other topics, here. This is one of my favorites to discuss. The whole concept of machine learning, which I would argue has nothing to do with machines learning much, at all. It is largely misconstrued, certainly in the mass media, but even within our own ranks of people in the data management analytic space. Pardon my stuttering there. It seems to me the machine learning is really powerful, it’s very invaluable. I kind of view it as high powered A/B testing, really, and what you’re doing is you’re using these various algorithms, like Random Forest or segmentation or to really fine tune a particular lens. Isn’t that what it boils down to, is machine is great for fine tuning things.

Dan Graham: Yeah. I think that’s true. Basically machine learning isn’t really new. This whole data mining and machine learning, I got my first lessons back in ’96 and ’97, from some very brilliant people. It’s nothing new, but what is new about them, is that now we’re dealing with larger amounts of data. One of the things you discover is that the more data you have, the more accurate the models become. We have a little bumper sticker inside Teradata, its called “mo-data, mo-better”, which means the more data the better your accuracy becomes.

In fact, there’s a belief system building in some of the data science communities that you don’t, it doesn’t really matter which algorithm you use, if you just use more data, you’ll still come to the same conclusions. You’ll still find the same outliers. If you have four algorithms you can play with, it doesn’t matter, just keep throwing data at it. One of them will come out ahead, but generally speaking they will all be good enough with enough data. There is a lot of interesting things happening with the analytics environment. I don’t know. I think the algorithm economy is starting to emerge, that’s one of the ones that sort of excite me.

Eric Kavanagh: Yeah. It’s very useful stuff, but it has to be used within a particular context. You really need to know, where, when, how, and why to apply these algorithms, and then what to do with the end results. I’ve seen, just doing some work with the South by Southwest Interactive Group, all these companies throwing out machine learning as if it’s a pan of sea, and of course it’s not. It’s again, another great tool to have in your toolkit, but it’s not going to solve the world’s problems, unless you really focus it precisely, on a particular business problem. Right?

Dan Graham: Right. What is machine learning really doing? It’s doing pattern detection. It’s looking for patterns, very much the way the human eye looks for patterns. We look for faces in a crowd; we detect a pattern, so that I know its Eric, that kind of a thing. That’s really where we need to go, and then I think you hit a good point, if you don’t turn that into some business value, you don’t connect it to a very specific vertical use case, it’s just interesting technology, and it doesn’t really do anything. It sort of falls to the floor. The difference from maybe what we were doing ten or twenty years ago, to now, is the data scientist is absolutely required to get their fingers dirty with the vertical business problem. Here we have this graduate level mathematics word problem, that’s really what we are dealing with, is that word problem, and we have to express it in terms of the business needs, then map it into, “How come this pattern detection, actually makes a difference?” Very interesting pattern detection, so my gosh.

Eric Kavanagh: You can figure out things that you would have never figured out before. I think is one of the keys. Especially as you suggest, when you add lots, and lots, and lots of data you can start to see patterns emerge that I would argue that would not have been visible, before. If you keep your mind open, that can be very powerful stuff.

Dan Graham: Let me give you an example, we’ve actually had multiple customers in different walks that have run into this situation. Most of it is sensor internet of things based, and so it happens in helicopters. It happens in earth movers, like Caterpillars, and Kubota machines, you’ll have a circumstance where you have this three or four hundred, thousand, or five million dollar machine. You’ll put anywhere between fifty and two hundred sensors on the machine, taking all the data on those sensors, and feeding it back, one of the things it might happen, and has happened, quite a bit, in fact is that you’ll find like twenty of the sensors are all pushing up towards threshold, which basically says, if you cross the threshold, your in the danger zone.

This vehicle is about to have some serious crash, whether the engine seizes, or the helicopter starts heading towards the earth too quickly, whatever it is you are reaching up towards that critical threshold.

Problem is the twenty sensors are all bumping their head on the threshold, but never crossing the line. What that means is, the idiot light never goes on, and Eric, you’re up there in a helicopter, and you never see an indication that you are about to die, but twenty of the sensors all at the same time are flickering at microsecond speed, just close to that threshold. With the machine learning you can detect patterns, and you can see that these things are happening when a human cannot. You can also see that this is happening when the programmer couldn’t possibly anticipate all these combinations of sensors having this behavior. You cannot use a rules engine, you cannot use a programmer to detect this kind of thing, you need a pattern detector, which is machine learning that can say, you know what, Dan, Eric, you got this situation, and you might want to land the helicopter, or turn off the Caterpillar, and get out of there.

Eric Kavanagh: Right.

Dan Graham: Does that make sense?

Eric Kavanagh: Yeah. That’s a great example. It’s a great explanation. This is good stuff, and you know, as I think about everything that’s happening, really what we have is this nexus of innovations. We are talking about machine learning, we are talking about open source, we are talking about Big Data, of course that’s fueled. Big Data buy things like mobile, and social, and that other big thing called, the Cloud, our new data scientist, Des Blanchfield, he was using the other day that there was a famous quote by one of the heads of IBM, probably fifty years or so ago, said, “I cannot imagine there’s going to be a need for five, or six computers in the world.” Of course, that changed dramatically with the PC, but Des’s point is that we are kind of going back in that direction now.

If you look at Amazon, for example or Rack Space, or Oracle, or Microsoft, or these other big vendors, and their massive Cloud plays, those are kind of like gigantic computers. The Cloud is just the network array of computers with all kinds of functionality baked into it. How big a deal is Cloud for analytics? How much of a game changer is it, and how, and why?

Dan Graham: I think, what we’ve seen is that customers are still hesitant to putting their crown jewels data into the Cloud. I think they are finally getting over that. The first roadblock they would throw up was, oh, gosh, the security issues. What happens if I put all my data warehouse, my fine refined data, that’s so valuable, what if I put it in the Cloud, and somebody at Amazon steals it? They don’t realize that the Cloud vendors have better procedures for security, than many of the customers who are concerned about it. It turned out that the Cloud security isn’t a big boogeyman, or problem that it’s been portrayed to be.

I think, in terms of getting the data from the analytic environment into the Cloud, that’s starting to become more and more possible. I think, the Cloud vendors still have the problem that the internet is not fast enough to move a lot of data up there, so we’re still doing, we’re still calling Federal Express, and handing them all these disc drives, these terabyte disc drives, and saying, here, send it in, so we can load into the S3 storage bins. That problem still needs to be solved.

One of the things that I am seeing, is that, first of all, a lot of customers, and Teradata, and other vendors, as well, it’s a Cloud first world, we built it for the Cloud, and then it works anywhere else. We kind of solved multiple problems at once. With that as the backdrop, I think, I’ve seen a lot of internet of things processes moving towards the Cloud, as well. That’s kind of a discussion of I don’t have a worldwide network so I will just let the Cloud vendors deal with it. That’s not necessarily solving the problem, there are still performance and latency issues, and other issues, but it’s not a bad plan in some respects.

Eric Kavanagh: Yeah. I am glad you brought up IoT, because there’s another massive sea of change, which again, I think we are just seeing the very beginning of right now, because when you start talking about these sensors, speaking back to the mother-ship, going back and forth, to the example you gave a few minutes ago, that’s very compelling stuff, and those are things that could have not been done before. We’re talking about a whole new threat escape, and challenge escape all at once.

Dan Graham: Absolutely. There’s a growing trend, or feeling that the internet of things is going to be a lot bigger than Big Data. By that, we mean the size of the data, the proliferation of applications, the number of customers who are going to do it, is going to be much bigger. While, Hadoop, and Big Data, and all that stuff has been rather popular, it’s going to be dwarfed by what happens with the internet of things. One thing that people have to think about is, does my company participate? Am I going to get disrupted? If I think I am not in the internet of things business, I’m probably about to be disrupted.

Eric Kavanagh: Yeah.

Dan Graham: We are seeing companies such as Verizon getting involved in connected, of course, we are already using their internet, we are already using their communications line, but AT&T has two hundred, two hundred, and fifty contracts for applications software to be deployed with our company. General Electric wants to be a software company, now, wait a second, these were not my traditional competitors, so in fact in the next few years, if you’re not thinking about building a Silicon Valley outpost to be a software company, your probably going to be disrupted, and you are not going to see it coming. It’s just not going to be obvious to your board of directors, and your existing management.

Eric Kavanagh: Yeah. That’s a really good point. That’s really got me kind of scratching my head, too, because one of the keys is you have to stay on top of what’s happening, you have to be inside those conversations, or have people in your company who are, and there you go with the value of being in valley. You’re surrounded by all of this, now, you and I have talked about this before, and nonetheless, you need to be grounded. My buddy, Mark, who once jokingly referred to the reality distortion field that surrounds Silicon Valley and that is not to be ignored, quite frankly, because there are some disconnects between what is thought about in that space, and what is really happening in the real world. The real world is pretty big. It’s a lot bigger than Silicon Valley.

Nonetheless, the epicenter of innovation, especially in software is taking place there, and it’s also largely fueled by this whole opensource movement. We started with Linux, now it’s the fact of standard for enterprise software, now with all these different projects coming out, whether it’s Hadoop, or Spark, or any number of any other projects, there’s a tremendous amount of innovation that’s spinning out of that area, and you really have to have someone if your an enterprise software company, who’s dialed into that world to make sure that you don’t get disrupted.

Dan Graham: I think, you’ve seen it Eric, you get around, quite a bit, I live in the Silicon Valley, there’s an outpost for major corporations all over the world, here, now, and what they are doing is they’re saying, we don’t know how to be an innovative software company, so we are going to form an organization, we are going to hire a bunch of people in Silicon Valley. BMW, is here. Walmart, is here. General Electric, opened up a big outpost, to do their internet of things. You go down the list there’s fifty, sixty companies, who have built these outposts, and it’s basically a whole valley, that you could call of skunk works. It’s a skunk work, in the sense that it’s do something, do something innovative, if you fail, okay, that startup goes out of business, all those people scramble they get a job in about a week, and we go to someplace else, and try to innovate.

This particular behavior is being fueled by an enormous amount of venture capital money. The Chinese money, and the other countries that have been booming over the last years, as I walk around, I see the amount of money that’s coming in, and these people are willing to bet that out of twenty projects, one of them will be in absolute over the top winner. The point still becomes, is that, there is innovation compatibility here, and in the area of internet of things, if you’re not getting involved, you’re definitely getting pushed a side in the next five years.

Most manufacturers know they have to be involved, the utility companies understand it, transportation companies understand it, but what’s interesting is when you move the insurance, and banking, and healthcare, organizations don’t immediately grasp the importance. They too, are starting to invest in the internet of things, because that is where the data is.

Eric Kavanagh: Exactly, right. We’re covering a tremendous amount of territory, here, thank you. Folks very few people in the business could speak as articulately about these different issues, as Dan, so we are glad to have him on the show. One of the other rising tides, if you will of technology, which has also been around for a long time, and of course graph theory itself, has been around a really long time, but now we have a lot of software built around graph technologies. All this graph analytics, stuff. Can you kind of speak to why it is that graph is becoming so popular, and what it does so well, that other database technologies don’t?

Dan Graham: Wow. That’s a big question. There’s this t-shirt that we’ve seen that everything is a graph. In fact graph is an epitome of relational databases. It’s the epitome of relationships, because instead of looking at the objects, it looks only at the relationships between the objects, so instead of emphasizing the core record, it emphasizes the connectors. The graph analytics, I think is where streaming data was about ten years ago. In other words, we’ve been watching streaming, and streaming analytics for, I’ve been watching it since 2002, tell you how old I’m going to be. The point is that it is still very young, and it is too complicated for the average organization to grasp.

The emphasis here, at Teradata, and I am sure to some other vendors is to try to make it easy to understand. The way to do that is through use cases, where we can make money. When there is reference stories, that you can make money, then the bosses perk up, and they go okay, I want some of that, I am willing to fund that project, but as you get into this graph analytics, it’s again, it’s the relationships, and there’s different ways of doing graph computing, I think that baffles people, there’s really two kinds of graph analytics, one is what I would call a graph database, and a graph database is tripled, or RDFs, if you will, is based on having between the different kinds of data objects. That’s very different from the bulk parallel version, which is what you do with things like Giraffe, and Titan, and some of the open source projects, and we have one called analytics. It basically does the same thing, it has bulk synchronistic parallel, and it’s not a database, it basically takes a sequel database, throws it in memory, and then goes through all the graph linkages in a completely interesting way. Again, BSP, is not a new algorithm, but it’s now starting to cook, and starting to be very valuable. Where seeing the use cases, we are seeing the things that people can do, but keep in mind, there is really two different ways, and they actually don’t compete with each other. You have to learn enough to be able to have that conversation.

Eric Kavanagh: Yeah. Right. You point out the key aspect here, which is understanding relationships between entities, that can be useful in all kinds of ways. Certainly we see them in social media, more and more we are seeing organizations and individuals realize the value of figuring out who those key influencers are. Because if you convince this one person who has a large following and a lot of other people will listen to that person, well, you just solved a pretty big problem, fairly quickly, as opposed to trying to persuade everyone across the spectrum of your worldview. Find out who those influencers are, and see who you can convert that will kind of do a lot of the heavy lifting for you.

Dan Graham: Absolutely. You’ve got the key influencers. You’ve got the wallflowers. You’ve got the sync. You’ve got people that hang on, and pay attention. The one I like is the algorithm, because basically that’s a circumstance where you are finding an influencer by checking their emails, and their phone calls, and all the connections that go into them, but this one individual doesn’t have any of that. We all know a boss, who kind of loves his cellphone, but he doesn’t do his email, he doesn’t do a lot of callbacks on his answering machine, he doesn’t leave a lot of trace, but if you notice that all the people around him, are incredibly important, then this guy in the middle is even more important, because they are all connected.

This of this guy is really talented or it’s often a CEO, by the way, it’s often a CEO who doesn’t do the email, has somebody else do it, he just relies on his phone. By inferring this information you can find people that otherwise, you just cannot count things up all the time, you have to use algorithms to find the people who are hidden, that way. The same thing occurs with hidden fraud. There is something called, I love, you’ll love this word, it’s the loopy belief propagation, with loopy belief propagation, yeah, I know it sounds crazy, doesn’t it? It’s basically, again, your inferring that some website may be fraudulent because there’s a number of other websites, that are connecting to it, and their fraudulent, and so you can do enough calculations on these relationships, and go, okay, I’m guessing that website X is probably fraudulent, it’s got a forty percent chance of being so, we will investigate it with a special investigation people. This kind of loopy belief fills in missing variables. It fills in things that weren’t actually measured.

Eric Kavanagh: You made a really good point, right there, in what you describe, because at the end of the day, a lot of times, these algorithms they are not going to solve some problem for you, but they are going to show you where to look. Right? You are going to need people who will look under the covers, and try to piece things together, and make sense of all this stuff, because as good as these computers become, and analytics gets, nonetheless, I believe, you are always going to need a human being to stitch together the bigger picture, and figure out what is really happening, and then come up with a solution of what to change, and how to change it. You need people to adjust the dials, pull the levers, and otherwise, make decisions based upon what they are getting from the algorithms.

Dan Graham: Absolutely. You cannot turn this over to machines, yet. We’re probably ten years away from letting the machines do this, too. Everything that you look at with a graph is kind of a ripple effect. If you think of dropping some water in the ocean, or in a big bowl, and it ripples out from there, there is influence all throughout the graph, and as you pointed out, can we find the kingpins, can we find the most important functions, or people, or events, in the graph. An example that everybody understands is home prices: if somebody in your neighborhood forecloses it ripples out across the entire neighborhood. Everybody’s value goes down. If somebody sells their house for an exorbitant amount, everybody’s house in the neighborhood goes up a little bit. This is the ripple effect that you can find. So now, here’s a pricing optimization tool called graph.

Eric Kavanagh: That’s a really interesting point. It’s the kind of thing that you cannot really accurately predict, you have to kind of stay on top of things, and that may be a good segue to talk about something you mentioned a minute ago, which is streaming data, and of course streaming analytics. I’ve been following this space for a while, and it really is fascinating. Talk about turning things inside out, as opposed to running a sequel query on static data, instead you’ve got standing queries on streaming data, again, looking for patterns, trying to understand what’s changing. You can kind of thing of some really good use cases around transportation, for example. Around networks, is a really good example, but streaming analytics, how big a deal is that going to be in the near future?

Dan Graham: I think you know as well as I do that it’s about as hot as it can be in the press, right now. Which means, we are at the tip of the hype cycle. I started working on streaming in my prior company, the little blue people up in Armonk, back in 2002, 2003; this is a market that has been hanging around for fourteen years, to become an overnight sensation. Now, it’s really, really hot, and I think this is all driven by Kafka. By the excitements of the opensource community, saying, I think that is really cool, I can start thinking of use cases for it. That’s a good thing. It’s really hot, but it’s also immature. It’s not an area where we have a lot of powerfully developed use cases in place. It’s growing, because as everyone pushes the envelope, we all learn from each other. It’s good overall; there are a lot of products coming on the market.

We have something at Teradata called Listener and it’s built on Kafka and a bunch of other open source, and a whole bunch of Teradata effort. Its purpose is to be easy to use, so that, Eric, I could spend fifteen minutes with you, I could teach you how it works, and in fifteen minutes after that you could hookup a streaming system, and send data to a data warehouse. Ease of use has some value to data scientists, and to programmers, and so that is a very effective way to do some of this. What you referenced, as well, and a good friend of mine at Gartner has been preaching this for fourteen years, and he’s finally come into his own. People are finally listening. You can do filtering and you can do some analytics on the data in flight.

That’s still an exploratory experimental area, for most people. There are some very good uses for it, I don’t think there is as many use cases as the press would like us to believe, but some of the ones that exist, are actually quite entertaining. I like the whole idea of putting ETO, into the stream, why not?

Eric Kavanagh: That’s exactly right. You know, you brought up another good example, that I think is a good segue here, for a final topic of discussion. Which is just really around staying competitive and staying on top of things, I think it’s just been fascinating, to watch as this opensource movement has evolved, and you’ve seen new players, with Kafka, which comes from LinkedIn. My partner Doctor Robin always has creative ways of saying things, in a webcast, he said, “Kafka is like Hercules.” It was born fully formed. They unleashed this thing in the world, it was already done. Now, Confluent, the company that is hardening, so to speak, they are doing new stuff to make it even more enterprise ready. Here you’ve got Cassandra, of course, which as I recall came out of Facebook. Hadoop comes out of Yahoo. You’ve got Kafka coming out of LinkedIn.

These are huge, highly durable, proven enterprise systems that are now being sold to other organizations to fill in gaps, and to change the way business is done, and frankly I’ve been really impressed watching companies like Teradata, and several others in the traditional close source enterprise software space, stay a head of the game, and figure out how to leverage this stuff. Just look at, for example, Spark, and Databricks, now Cloud Air claims to be the number one sales company for Spark, came out and said, it’s going to be fifty percent of their business, that was a bit of a PR move, I think on their half, on their behave, I should say.

Nonetheless, staying on top of what’s happening and leveraging all of that, and building upon the relationships that a company like Teradata already has with big customers, with real people in large organizations, I think you guys have done a pretty good job, but can you kind of speak to me about your overall strategy for staying competitive a mist all of this sea of change.

Dan Graham: Teradata, has jumped into the opensource business, not everyone knows that we’ve taken a major stake in Presto, which is a sequel on Hadoop product built by Facebook. Facebook still controls the core components of Presto, which they are doing for their production system. They don’t really care about the open source community, they donated it, but they’re not investing as heavily in that. They’re just trying to keep their production running, and running smooth. The Presto in our case, we are adding all the stuff around it that Facebook didn’t do, like installation, and some additional security, and other things.

Teradata is trying to become part of the opensource community. We’ve seen a couple of other players, like Microsoft reverse themselves in the last few years, and to start being involved in opensource. There are still a few big players that haven’t really dived in, but we’re definitely following that same path, we are using a lot of opensource in our products, so Listener probably has about eight opensource subsystems inside it. We use Elastic Search. We use Cassandra. We use Kafka. That’s important to us. Other competitive things, I’ll tell you one of the things we are doing. We are working on a new design for massively parallel, and we will see this next year. I cannot tell you a whole lot about it. Except that I personally call it MPP2.0.

Eric Kavanagh: Wow.

Dan Graham: It’s a significantly radical change from what we’ve all been doing for quite a while. That should come out, and that will be very competitive. That will be a very interesting struggle in the marketplace. Number one, we are trying to solve data drift. Data drift is when you have a lot of Hadoop data marks, and they start to be out of sync with one another, and with the mother ship, and the consequence is, you start to have a lot of reconciliation problems. You have bad data. We are working a lot on that. We are working on an app center.

Imagine for a moment that you’re Apple app center is a place where you go to get your business analytics. You go up there, and you say, “I need one of these, and one of these, and one of these to build an analytic process that I need to accomplish in my vertical business in retail, market basket analysis, or maybe in oil seismic surveys.” Thing of terms in an algorithm economy where you can buy apps that you need, and put them in your own framework, so that your users can use it. Those are the kinds of things that Teradata is working on, and I hope that it is competitive.
I sure think it is.

Eric Kavanagh: Yeah. That is all great stuff. The app store concept, because if you think about how long it takes to get some of these solutions up and running at large organizations that’s serious concern, but if you can make these things as plug and play as possible, hey, that’s the magic kingdom, it seems to me. That’s when things really change at even greater pace, kind of what you see in the world, these days. That’s all good stuff. One last question, the reported shortage of data scientists, and I think that’s an accurate prediction, or accurate assessment by some of these research firms, and of course there’s that joke, that a data scientist is just a business analyst who lives in Silicon Valley, but it seems to me that there is going to be a whole movement toward, it’s already happening, educating people, young people perhaps, even some people who are in their middle age to better understand how to use these technologies, because there is going to be so much opportunity to do that.

Dan Graham: I think the important part is that people are willing to pay, now. A data scientist was an invisible person ten years ago. They were the geeky kid that we use to beat up in high school. The guy that knew mathematics and the rest of use copied from his paper. Nowadays, the data scientist have been deified, they’ve been made extremely important, because what they can do for the value of the corporation. What’s happened is the boardroom is now talking about data science, and frankly, most of those people don’t understand, most of the conversation we just had, but they are trying to figure out how they get their hands on something that produces this digital value.

The data scientist, obviously, I don’t think we are going to have more than thirty to fifty percent of what we need over the next ten years. What that means is a lot of coming out of India, and China, and Russia, where a lot of these talented mathematicians live, today, or are being schooled, today, we are certainly not going to grow them here in the United States. We have a sort of a flat tube down birth rate, and these other countries do have a tremendous number of talented people. The consequences are that we certainly need a lot more universities in the world, we happen to have relationships with a number of universities in my company. I happen to know up at Northwestern, he runs a full-time data science curriculum. We need a lot more of that.

Eric Kavanagh: Good point. Folks, I’ve been talking with Dan Graham about just about everything there is to talk about in the field of data science and Big Data. Talking about Enterprise Hadoop in Production, the folks at Teradata have a lot to say about that. A big thanks to you Dan, keep up the good work, and I’ll catch up to you at a conference, or another show, sometime soon. Thanks for your time.

Dan Graham: Eric, it’s great to talk to you, again. Thanks.

Eric Kavanagh: All right. Take care, folks. You’ve been listening to Enterprise Hadoop in Production with Teradata.

One Response to "Enterprise Hadoop in Production"

  • Thulasiraman Sriramulu
    April 18, 2016 - 10:55 pm Reply

    Thanks for the excellent interview. It really helped to understand how the data warehousing is evolving.

Leave a Reply

Your email address will not be published. Required fields are marked *