Inside Analysis

Infobright and Data Lake Survival

The following is a transcript of a conversation between Dr. Robin Bloor, Chief Analyst of The Bloor Group, and Rick Glick, CTO of Infobright. 


Eric: Ladies and gentlemen, hello and welcome to the Data Lake Survival Guide podcast series. My name is Eric Kavanagh. I will be your moderator for today’s discussion with two of the smartest people I know in the business. We’re very pleased to have Rick Glick, CTO of Infobright, and our very own Dr. Robin Bloor on the call today. We’re going to talk about a whole number of interesting issues and they impact the world at large, and of course, the world of enterprise technology. Lots of things are changing, lots of things are happening. We’re arguably at an inflection point right now. Of course, the data lake itself represents a significant and transformative change in how we all view data as an asset, how we store it, how we leverage it, how we access it, so on and so forth.

Like I said, we’re very pleased to have Rick Glick on the line. He’s a real architectural whiz. He’s certainly taught me quite a few things over the years, and he’ll be interviewed by Dr. Robin Bloor. First of all, gentlemen, thank you for your time today. Rick, thanks for your time today, and Robin, as well. Let me hand it over to Robin Bloor to do a smart interview. Take it away, Robin, the floor is yours.

Robin: Thanks for that intro, Eric. Rick, let’s just start with just a general picture of the industry. The way that I see it is that the IT industry has moved forward as it has done decade after decade, but in the last five years it’s been incredible enthusiasm for big data and IT analytics. My question really, to start with, how compelling do you think this application area actually is, and for what kinds of businesses do you think it is compelling? The question behind that is, is it all businesses?

Read Dr. Robin Bloor’s article, “Accelerating Analytics with Infobright Using Precise Approximation.”

Rick: We’ll start with the question behind it first. Of course, it’s all businesses. Who doesn’t want to learn from the past, learn from data, and learn from it, be it social media or whatever is out there, and positively impact what it is you do every day? I think irrespective of industry, I think everybody can be smarter using the data that’s in their industry to do a better job and to be more competitive and all the things that businesses do. The last five years have absolutely been amazing because there’s been this huge explosion of approaches and techniques and just novel ways of looking at some of the old problems and thinking about new problems. It’s a great time to work in data, isn’t it?

Robin: Yeah, it is. What I’ve noticed is the hardware has already sped up, so that’s been speeding up, but all of a sudden we had the introduction of parallelism and that made things go dramatically faster. Analytics has been a bit of a niche area, you know, there are a number of organizations, insurance companies, pharmaceuticals, and so on that were using analytics, but it wasn’t widespread, and all of a sudden it seemed to become widespread. It seemed to be going that people could do things way faster than they could do them before. Do you think there’s a need for greater speeds than we can achieve right now?

Rick: There’s always a need for greater speeds and there’s always a need for fewer resources to achieve those greater speeds, because you mentioned parallelism and that’s fueled a lot of the interest in things that are going on right now, but it comes at a huge cost. If you’re doing exploratory analytics, let’s peruse the data before I come up with, make hypotheses to test and prove, you want that to work at the kind of the speed with which you can come up with questions and thoughts and answers, and it’s highly iterative. We can accomplish that if we overpower things with resources, but it’s hard to accomplish that on a consistent basis everywhere. Does that make sense?

Speed is, for me, it’s incredibly important because I’m not a patient person. I want an answer and want it now because I know that that just formulates the next question and the next idea until I come to a point where I want to use, formulate a hypothesis, create a model, test a model. Those things could be done at a regular pace, but coming up with those hypotheses takes a great deal of agility and just flat out speed.

Robin: Yeah, I would agree with you. I’ve kind of done this job but I did it so long ago, I was an actuarial guy at one point in time, but it’s so long ago that the technology is so far, far different. Really, the point for this job is it is iterative. Once you’ve done your thinking, you would like the answer just to appear in a fraction of a second in actual fact, so you can continue with your thinking. I don’t think there’s a limit to the speed which people will demand that it’s deliverable. Let’s begin with what Infobright does because I understand that you have some fairly unique capabilities in this area. Could you explain this?

Rick: Sure. We’re developing the ability to ask larger questions off of massive amounts of data or massive amounts of data for the size of your organization within your resource limits and get incredibly fast answers. We call it approximate queries, but that’s not the right emphasis. We relax the exactness of a response, and we do that in order to get incredible speeds and be able to do that exploratory analysis, be able to come up with your hypothesis at fractions of the time, at fractions of the resources. Really it’s kind of breathtaking to actually watch this work. It’s incredibly fast, and it just changes your whole pattern of working.

When you have that kind of power, you can ask ever bigger questions, and you can really have that conversation with the data that’s kind of important. We do it all by actually not doing a lot of the things traditional databases do. We don’t actually store the atomic data, all of the atomic data. We analyze it all. We create models of it all, but we don’t actually store it. We do operations against the models which transfer the models, and then we simulate the answers, we generate the answers then. It’s a radically different approach that puts the emphasis on software, not hardware. Again, a fraction of the resources, an incredible performance.

Robin: If I were a data analyst and if I wasn’t getting the thing that I need in order to be able to create some model of the area that I wanted to dig into, I would just take a sample. It would depend upon the data and how variable I thought it was, but I’d probably be very happy with a 10% random sample or a 5% random sample and very large amounts of data, even smaller. How is what you’re doing different from that?

Rick: Samples are absolutely incredibly useful technique. If you take a 10% sample you’re going to use, roughly, and you store just the 10%, you’re going to use roughly 10% of the resources. If you use the same amount of resources and hardware at it, you’ll go 10 times faster. That’s a nice incremental improvement. Samples struggle from a couple of things. If I use just a randomized sample, I’m perhaps going to miss some of the outliers and the very things you’re looking for. There are other sampling techniques to make sure that you get the outliers as you analyze the data, but then you miss other patterns in the data and then you can’t do stratification. There are all sorts of good techniques to work with sampling, and sampling actually is a really useful technique.

What we’re doing, however, is we actually make a model of every single piece of data. We model the data completely, so we tend to get all of the outliers, tend to be able to use it in ways that you haven’t thought of when you were thinking about how you want to sample this particular set. We do operations, not against individual rows of data from sample, but against what the model comes up with. The way we do it is we’ll model a block of data at a time, and we’ll go through all the data and model all of the data one block at a time. Let’s say the block is a 100,000 rows. We’ll do a database operation like a filter or an aggregation or a join. We’ll do it on that 100,000 rows, and then we’ll do it on the next 100,000 and the next 100,000 rows and then we put that together.

We don’t do operations on individual rows. We do them on these blocks of rows. The speed of this is 10 times, the speed up is 100 times or more faster than sampling, for example. Right now, and we’re improving the technology, right now our footprint in terms of the size of our models is roughly between 1-2% of the actual data volume. If you’re using a 10% sample, that’s about a 10% footprint. We’re doing 1% footprint driving better accuracy than you get with samples, recognizing more patterns and having the ability to add more ad hoc queries than you can with samples, which you take for specific reasons typically. Better performance, fewer resources, and better accuracy than what you’re getting for samples. Actually, I wouldn’t have believed it before we went down this path and saw what we’re doing, it’s pretty amazing.

Robin: I’ve always known, of course, having done it as a job a long time ago, I’ve always known that you take a random sample, if there’s a screw in the data, you won’t necessarily catch it. The problem with sampling is that having done a sample, if you go to the whole of the data you might find something different or so different that you need to redo it or rethink what you’re doing. Are you saying that with the techniques that you’re employing and the kind of modeling of the data you do, that you would, even though you’re calling it approximate, it would follow pretty much any pattern that was within the data but not absolutely precisely, just within a small margin. Is that how it is?

Rick: The algorithms we use bear a strong resemblance to many of the machine learning techniques. What we do is we build a model, we then use relational operations against the model, we test the quality of the answers we get. If we find patterns or find things that we don’t find acceptable, we go refine the model creation process, our data ingest process, we go refine that. While it bears familial roots to the machine learning languages, we could put a set of names next to it and say we’re doing this because none of them would be accurate.

Robin: Okay, I get you. In terms of machine learning, you have things like clustering, but it’s actually quite a lot of different ones that are heading on the same target. You’ve got regression. You’ve got the categorization techniques. Before you actually choose your particular machine learning technique of choice, you’re actually going to take a view as an analyst on what exactly you are attempting to discover in the data in terms of its fundamental patterns. I get what you’re saying. Would you then say that you had a broad level in the way that I’ve described, the kind of equivalence to every one of these techniques that I mentioned and the other ones that I couldn’t bring to mind?

Rick: Yeah, we put them together in unique ways but we use all of those ideas anyway, and then we model not just data distribution, but we spend a lot of time modeling the database. We model the domain of the column, the attribute. We model where things probabilistically are different than what you would expect using the model, and we model those things separately, so there’s a bunch of techniques that go into it, lots of notions of clustering, little bits of regression. All of those things kind of fit in, but it’s kind of our own weird wonderful thing using a lot of those individual techniques.

Robin: Let’s discuss some various use cases. What I’m interested in here is whether the data scientist picks it up and runs with it, or whether you need to have a period of education so they properly understand what you’re doing. How does all of that work out?

Rick: I think they pick up and run with it like they do a standard database. I suspect that people will want to get confident that the results are what I’m saying they are, that they choose the accuracy, that they can follow the trends, that they can actually build models based on the simulated output that we give, that they can actually build a model, and then they’ll want to test those models against real data I suspect. I think they use it the way they’ve traditionally used a standard relational product just asking bigger more impressive questions over larger amounts of data, and being able to interact with it in a much more natural way.

Robin: I can kind of see that. People talk about analytic databases like it was a thing, like we had databases at one point in time, it stored better, you could get it back through queries and things like that. Then people start talking about analytic databases, oh, it’s a different kind of thing. What would you personally think an analytical database was and would you say that Infobright in its current incarnation was exactly that?

Rick: First question, what is an analytic database? It’s one that has a different design point than the standard OLTP let’s just do database management things. It’s focused on joins and aggregation, typical database things, but at large volumes and at greater performance. That’s exactly what Infobright does. We have a different design point than something which does transactional processing. Our design point is around large aggregations, large joins, filtering but working with large sets of data. I’ve never really liked the term analytic databases. It’s just a relational database.

There are things which have different performance characteristics and different design points. I tend to think things which are working with large amounts of data to synthesize and bring down a nice answer set that’s useful is analytics. To be fair, I think there’s also bits and pieces around extensibility that fold into some analytics databases. I think extensibility kind of folds into the definition, but that’s a level down from the overall what’s our design point, what are we trying to accomplish with this product.

Robin: Let’s try and give the listeners some idea as to when you talk about very large amounts of data what you’re talking about. Let me just preface it by saying I looked at a survey recently in terms of the size of data lakes that people are building and it was way smaller than I expected them to be. There was five terabytes, 10 terabytes, some actually even only one terabyte, which doesn’t sound like a data lake to me at all. It sounds like a data puddle in the way people talk nowadays. The very large ones are the petabyte area, very few and far between. In terms of your talking about you deal with large amounts of data, where would you say that your sweet spot is in terms of data volume and where would you say that your limit is?

Rick: First let me step back and talk about the comment you just made by what is a large amount of data. I actually think that’s context sensitive. I think what is a large amount of data for a particular organization, and that’s where you get the five terabyte and ten terabyte. That’s one part of the question. I think that’s contextual. What’s big for me might not be big for you. Then the other thing that I always find interesting is how much data do you put behind a computer? What are those nodes for parallel processing? If you’re wanting to work with petabytes, you tend to put lots of data behind a node because you need to. That’s what you can afford to do. If you’re working with 10 terabytes, you might put far less data behind an individual computer for your parallel processes.

I’m comfortable – and that question “what’s your sweet spot?” is always really difficult – what I would say is I’m comfortable with a single node doing processing for about 200 terabytes of data. Instead of putting 500 gig or terabyte or two behind a node, I think that we could probably do a really good job with 200, 250 terabytes behind a node.

Robin: That’s impressive to me. I don’t think that I’ve ever heard anyone suggest that that would be a reasonable data distribution because that’s very large.

Rick: That’s a large amount of work that a single node is doing, but remember, we’re only keeping that 1-2% footprint and while we’re analyzing all the data, another set of nodes that a company needs to have, we have to see all the data to build our model. We do that in a highly distributed fashion out everywhere. A given node, we can do a ton of really good work on a single node for the engine side of things.

Robin: I’m going to talk in terms of the old kind of way of thinking about the world as CPUs, as memory, as disc. Where would you say you’re likely to bottleneck in terms of if I were … I’m a data scientist, maybe there’s several I was throwing various queries of data. What do you saturate? Do you saturate CPU, do you saturate memory, do you saturate I/O?

Rick: It’s purely CPU because we do a lot of calculations when we’re working with a database operation in comparison to the amount of I/O we do. Every other database I worked on, we are optimizing the heck out of I/O. This is the first time where we have to give serious thought how to optimize CPU because this application is CPU intensive. One of those roadmap items is I really want to do the SIMD stuff and the vectorization of this stuff because we’ll get another order of magnitude kind of performance gain when we get to that level. We’re not doing that yet though.

Robin: That’s on the roadmap because that was actually the question I was going to ask you. I would agree you’ll get 10X. I can’t know that but those have done it tend to get that kind of lift out of SIMD on the CPUs.

Rick: Most of those projects you’re talking about, they get kind of lift on the CPU and they’re still I/O bound. I/O isn’t our issue at all.

Robin: I’m going to ask you a future question which I think you can only comment on it unless you’ve actually really messed around with it already, but most of it is actually not really available yet. Intel has got this new 3D cross point memory and we’re now looking at much, much larger capabilities of putting SSDs on individual nodes. About every couple months I run into another little hardware nuance that somebody is exploiting. It seems to me that until about 2005 we could think of computers just as it’s CPU, there’s a bit of memory and there’s a disc, but now it’s actually become a complicated thing. The nice thing about it is everything is going a little faster. Do you see all of that playing into your architecture as favoring the way that you do things?

Rick: Absolutely. If you’re thinking about things like the SIMD sort of stuff, what you really have to think about is how quickly can you deliver from memory and from large amounts of memory to that process? That memory to process or channel becomes really important. Those optimizations you’re talking about tend to be let’s get rid of the bottleneck of controllers, the way you traditionally think about controllers and things. Absolutely that’s going to favor what it is we do and improve it even more.

Robin: I figured that that was the case but I thought I’d ask just in case I was wrong. Let’s go onto just use cases because we’ve had a lot of technical talk and it might be nice to actually just talk about various kinds of applications where Infobright has played a part in providing a solution and certainly any colorful ones I’d be interested in hearing. I’d also be interested in hearing ones that are hitting very large amounts of data.

Rick: I think one area that people talk about a lot right now is the whole IoT kind of architecture. Vast amounts of machine generated data coming from all over the place from lots of different instrumented things coming together and being analyzed together. Everybody is looking at the individual devices. Not a whole lot of people are bringing them all together and bringing all that data together and using it very effectively. I think that’s an application area where we sit beautifully for a couple of reasons. One is just the performance characterizations and doing that. The other is you can compute these models out on the edge and only ship that fraction of the data and relax some of those network constraints, as well. You don’t have to move as much stuff if you just move the 1-2%. I think IoT is a future that makes a ton of sense to me. I don’t know if that’s exotic but that’s kind of at one end.

Then data mining and all of that exploratory stuff we’ve talked about. I think plain old BI tool or something if you’re going to get insights by looking at visualizations and graphs. The visualizations and graphs you can get from something like this are for the most part indistinguishable of what you would get with an actual answer, you want to have so many pixels on the screen. I think those old help me visualize data in whatever form that takes is helped tremendously by what it is we’re doing. Beyond that, we could talk about very specific use cases. I’m not the best person always to do about that. I tend to look at a different level.

Robin: Let’s go into the IoT situation because the IoT fascinates me. When you read various things, when I talk to various vendors that are one way or another involved in interest, you get completely different pictures. You get the idea, for instance, of jet aircraft. We’re rolling up terabytes of information in just one flight which I would have thought that you were in a good situation to actually deal with that stuff because you don’t need to move that much around. You just pull of a small proportion of it. That’s one kind of thing. There’s going to be a similar kind of situation in automobiles probably all over transport, wherever you think about it.

The interesting thing to me about that data is when you actually look at it, I mean, most cars are about a couple of hundred sensors so it’s actually a lot of different information. It’s not like a nice log file that you can tear through in one way or another simply because all of the structure is the same. In that circumstance, the Infobright database, is it built for that kind of thing? Does it perform well in that kind of circumstance?

Rick: It does because we have two real key advantages. One, the process of building the models. A lot of sensor data turns out to be rather homogenous except for those things that you’re really interested in which are outliers. When we build the models, they are even in that kind of data more compressed than in, say, retail data, point of sale data just because of the nature of the data. Then you have, as you said, a bunch of different sensors that have to relate to each other and because we’re designed around the notion of ad hoc joining and things. It’s not a single table, it’s not a single log file. It’s a whole set of things which come together to paint a picture of the operation of the jet or the car or whatever device it is we’re talking about, even the city, because we deal really well with lots of different kinds of data.

The sorts of things that come off of sensors is really very amenable to the modeling techniques we’re using. It’s kind of an ideal use case sweet spot, whatever you are for what it is you’re developing. Then if you think about that jet coming up with terabytes of information, if you want to compare it to other jets that are all still in the air at the same time, it’s hard to transmit a terabyte of information. It’s easy to transmit a percent or a half of percent footprint of that and get all that same power of analysis comparing jets that are currently in flight.

Robin: That’s really interesting isn’t it? Because you’re now talking about an application that I’m not sure any other technical approach could actually do. Am I right? I might be wrong. There may be some stuff out there in this area that can actually do that, but this sounds unique to me.

Rick: I hate to say that nobody else could do it, but it is unique because there are really clever people and you can think of how you create a particular set of triggers or sketches or something about what’s going on in a jet. If you want to bring together all the information across multiple jets, I don’t see another way of doing something like that. I think there will be other places in time where we see new categories of things that you can do that this approach enables. Here we’re talking about one with lots of data, low bandwidth connecting all of the data, but yet a need to look across it all. There’s a category there somewhere. I’m not sure what it’s called. I think what we have is powerful and unique in that sort of situation.

Robin: I’m inclined to agree there’s a category there. I think it’s kind of too early to say because most of what’s actually happening now, actually even in analytics I think this is true, never mind in this subset of analytics we’re talking about. I think that the full power of what will be achievable. We have had this massive computer power parallelism only for a short period of time, and obviously the first thing that you do is you do what you were doing before but you do it faster and that changes things. You’ve also got things like most of the power … It wasn’t possible to do graph analytics until we had that kind of computer power because it graphs the data all over the place and now it’s becoming possible and now people are finding applications within graph data structures that they didn’t previously know, you know, patterns that are useful that no one had ever looked at.

I think that we will probably discover, the world being what it is, that there will be a lot of similarities once we’ve done a lot of this. I think it’s almost certainly that that will be the case. Let me ask you another question because that’s one thing. That’s the idea that you have really a large amount of information but you’re able to broadcast the information back to a central point where you can analyze it all as a whole. How does Infobright work in a distributed way? For instance, let’s say I didn’t care to do that but I wanted to have an instance of the same kind of algorithm applied in lots of different places and I don’t want any kind of centralization of the data. I just want to know a general very small result coming back, let’s say, a graph or something of a situation. Are you fully distributable in that way?

Rick: The part that creates the model is fully distributable. We’re not totally there on the engine side. Think of the product in two parts: model creation and then engine execution against the models. The harder part, the part that’s really the most time consuming is the model creation. That’s highly distributable if you’re on Hadoop it could run in the context of MapReduce jobs all over the place. It hooks up to Kafka. We can skin it a bunch of different ways. We can listen on a wire and analyze windows of data as they go by. The model creation part is hugely distributable. What comes out of the models … The power is bringing those models together to operate on them, and that’s what we do. We bring the models together to operate on.

Robin: That’s interesting. As you mentioned it, as you threw in the words Kafka and Hadoop, I just thought I’d ask the question in case anybody is listening who doesn’t even know basically the technologies that you integrate with. What is your relationship to Hadoop and what is your relationship to Spark and Kafka and what is your relationship to anything else that anyone might talk about? Are you able to work in all of those environments? Are you compatible with all of those technologies?

Rick: Yeah, we’re compatible with all of that. In Hadoop, they have a component which controls resources, YARN, we fully work with YARN. Like I say, for Hadoop, we work within the context of MapReduce job. We actually understand HCatalogue and we work with Parquet files and all of that stuff, so we’re a first class citizen in Hadoop. Kafka very similar situation in how we interact and work with Kafka. Spark, we actually do through Kafka. We use Kafka to interact and work with Spark jobs today. What we do have is the thing that builds the model is it comes in the form of a library that we put things around to deploy in the different environments. We skin it different ways. We call it an agent, our Infobright agent, but it’s really the model builder thing, the knowledge builder. We can skin that very quickly to work in any environment natively.

Robin: Really, if I understand that correctly, that qualifies database that can be adjusted to any file system underneath it, is that right?

Rick: This itself isn’t the database. It’s the part that builds the model, and then we centralize the model to an engine. We don’t have the traditional … The models, sure, they get stored in a file system but we don’t have the really traditional storage system underneath because we’re just one thing and one thing only. We store models of the data. Full disclosure, we intend to store actually data so we can have hybrid operations where we join actual data to approximate tables. That’s, again, a roadmap item.

Robin: Let’s just go back to the data scientist. Let’s just try and take the perspective of the data scientist. Data scientists have sets of tools basically. That’s the way I think of it. If they were using Infobright, would they be able to use it exclusively for what they do or would you expect to be part of a collection of tools?

Rick: I always expect Infobright to be part of an ecosystem and a collection of tools. We do one piece of the job which is relational operations against these model that we build. Over time, we’ll probably open it up so that we can do some different things but that’s what we do today. There will always be modeling tools people will use, whether they’re R programmers or Scala or Python or whatever it is they use, they’ll always use that in conjunction with something else like us. They’ll probably want to ask exact answers on something, too, they’ll maybe have another database there to do the exact work with which we can fit into that ecosystem.

Our goal isn’t to be the only tool around. It’s to make every tool better and to fit into an ecosystem and let you pick and choose and build best of class in every place, and we’re going to do the thing we do really, really well which is give you those exploratory capabilities, the ability to iterate through the data, the ability to simulate answer sets because that’s what we do, and to simulate data and things. Over time, we might expand from that base, but we will never take the position that we’re the only tool that a data scientist needs. There’s way too much innovation and really good interesting things going on right now. I think we’re part of that and that what we want to be.

Robin: Yeah, I would say too. Are you ever used as just a BI database or is the focus entirely in the area of the data scientist?

Rick: I think we’re outstanding as a BI database. Like I said earlier, any time you use a BI tool or a MicroStrategy or a pick your poison, Business Objects, something more modern – it does nothing but lower your cost because of the resource and speed up your performance, and the results will normally be indistinguishable from working with the traditional databases.

Robin: That’s interesting. You are kind of a general purpose in that sense. I know you’re a read only operation, by any stretch in imagination, but you are actually general purpose in the sense that … In lots of organizations, BI is all they do. They may call it other things in one way or another, they may think of it as being analytical but they’re just using the traditional BI tools.

What I think is happening, and you can tell me if you disagree, but what I think how has happened is we got a whole series of these companies that now have become interested in genuine analytics and now realized that they can get way better information about their organization than they could before. There’s a lot of people that I think are expanding out of BI into this field, and then there’s the web businesses where if they didn’t have data science, the business wouldn’t exist because it’s basically all it is is data science. I think of Twitter and Facebook that way to be honest, even though they do a bit more than that. If they didn’t have the data science aspect of it, there’d be nothing.

Rick: Yeah, and any of the ad tech firms who are trying to service up ads, that’s all data science, all analytics. You’re absolutely right. BI is a place where traditionally people start and now they’re getting excited about capabilities that they see in the newer age folks do and they want to bring those into their organization. I don’t think BI goes anywhere. I think that’s a useful technique for gathering insights or visualizations. I think it’s great. We do a really good job of being the engine underneath the BI tool. You get a lot more excited and passionate about the data science side about it and the newer analytics.

Robin: We’ve only got a few minutes left. Eric, were there any questions that you’d like to ask?

Eric: Yeah, I’d like to throw maybe just one or two over at you, Rick, one of which is geared around this whole concept of the data lake and why it is so transformative. It seems to be one of the reasons is because when we think of the old world, if we can call it that, of data management and enterprise data warehouses, for example, so much of the innovation was built around the constraints. Meaning, high storage costs, thin pipes to move data through, for example, slower processer speeds. These days, many of those constraints are not nearly as restrictive as they used to be. I think that’s one of the enabling aspects of the data lake. Besides that, we have this concept of store the data once and only use it as needed which is a significant change. How significant do you see the data lake as a transformational force in the world of data management?

Rick: The other thing about the old school was it tended to be these kind of monolithic but not all-inclusive processes. You would have a database or a data engine which was the data warehouse, and it did some things really well, but there was a whole bunch of things when you talk about data science and algorithms and analytics that it just wasn’t designed to do. The current architecture opens that up for all kinds of analytic innovation. You get the notion of let’s store the data and not use it one or two ways but a hundred different ways out of there. I think we’re still kind of growing up in the data lake on what’s the best way to manage all of that data. I think we could learn some things from the past there still.

What I think is transformative is all the different ways that we’re opening up to use the different data, the same piece of data to be used in a bunch of different ways. I think that’s really, really exciting actually.

Eric: I’ll throw one last question at you because Robin did a pretty good job of mining all the material that we had to discuss today. You mentioned near the top of this conversation the concept of a conversation with data. I’ve been hosting literally thousands of these conversations over the past 12 years or so with various companies and individuals who are obviously interested in this subject area, and I have always felt that the speed of interaction with the data as you analyze is so critical because without a rapid fire response from your data, from your information assets, you cannot have a conversation because the latency truncates that conversation.

What I mean is if you send a query in to your data and you have to come to lunch and come back to get the response, the thought process that you had going is now gone. You might be able to recollect it and rekindle the excitement in your brain if you’re a really good analyst and you really care about the subject area, but my point is that the speed of response is so important in terms of enabling what we can honestly call a conversation with data. As I see it, that’s what you guys really excel in providing, right, is that speed of response to enable a really thoughtful, deep, and meaningful, as you say, conversation with data. Would you agree?

Rick: Absolutely. When you have a conversation with a person, the fun ones you never know quite where they’re going to go. It’s the exact same with data. When you have a good conversation with data, you’re never quite sure you’re going to go. You have some general target and if you can’t have it at a conversational pace, then it’s not a conversation. It’s “let me think for 20 minutes because I know it’s going to be really hard to ask the question because it’s going to take me an hour to come back.” That’s not a conversation. That’s being good at your job but it’s not as fun. You don’t end up with the innovation. You go where you’re targeting.

Eric: That’s right. I guess I’ll throw one last question at you which is this whole concept of schema on read. If we go again back to the traditional way of storing data in an enterprise data warehouse, well, the IT engineers and the data architects really had to think through the kinds of questions that analysts would be asking in order to enable an environment that would deliver the performance acquired. Whereas now, to your point, about good conversations going places you don’t expect, that’s the whole beauty it seems to me of schema on read. Meaning, you don’t need to constrain your information environment with a model unnecessarily and you can spin up new models as needed in order to facilitate that perhaps jagged or unpredictable direction of the conversation you’ll have with your data, right?

Rick: Yeah, that’s right. What we were talking about was transformative with the data lake, the other one is the whole notion of being able to define the model on read time because you can’t be wrong with the data model. A good data model always, no matter how good, limited the domain of questions you can ask. When people started doing the religious wars against third normal form against dimensional modeling and all that, really those were about how much freedom you have with the model versus understanding ahead of time what you’re going to ask.

Schema on read is actually a really important notion but it’s also hard to have a conversation with a person if you, I’m working in Poland, so I’ll say if they’re speaking a different language than you. It’s important to actually have some sort of definition and schema and things as you work with it and you go down the conversation, but being able to define that in a really simple, easy way when you need to define it really empowers people a lot.

Eric: Yeah, that’s good stuff. Well folks, we’ve burned through just about an hour here on a wonderful conversation about data, so big thanks to Rick Glick, CTO of Infobright. Very interesting company with some fascinating technology I must say, and of course, our very own Robin Bloor. This has been a conversation about the Data Lake Survival Guide. Thanks again guys for your time and attention. We’ll catch up to you next time. Bye-bye.



Leave a Reply

Your email address will not be published. Required fields are marked *