Inside Analysis

Legacy Systems and The Philosophy of Data

This interview is part of The Bloor Group’s research program, Philosophy of Data, which has been underwritten by IRI, The CoSort Company.

 

Eric: Ladies and gentlemen, hello and welcome back, once again, to the Philosophy of Data. This is Eric Kavanagh, your host, and today I’m speaking with my very good friend, and business partner, Doctor Robin Bloor, co-founder of The Bloor Group. Doctor Bloor, welcome to the show.

Robin: Thank you, it’s good to be here.

Eric: Sure thing. Were talking about the Philosophy of Data. Of course, this whole thing is sponsored by IRI, The CoSort company. Big thanks to those folks for giving us this endowment to talk about such an interesting topic, as the Philosophy of Data. It was just the other day I got a message from Gwen Thomas, who did an interview with us very, very smart lady. Very deep and philosophical herself. She asked me if I was going to weave Plato’s cave into this discussion. I thought, “Well, of course, what an excellent idea.” Plato’s cave is one of my favorite metaphors in the whole world of philosophy.

For those who don’t know what it means, basically, Plato said that life is kind of like this cave where the common people are at the very back of the cave facing the back wall, but shackled to a wall that faces that back wall. Then, up above them, behind them, or a group of people who are making hand puppets in front of a flame that’s being burned, in order to essentially show the common people what’s really happening, but all the common people look at are the images on the wall, almost like a TV screen, essentially. The real world is actually outside the cave itself.

His point is that there are people who are in charge of manipulating what you, as a common person, see. He said, as I recall, it was media, the government, and the church were the three groups who were responsible for making the hand puppets, and that the common people really were just subjected to a bunch of nonsense, because that’s not the real world at all. The real world is outside of the cave.

I thought to myself, “What an interesting metaphor to describe our industry.” If we take that metaphor literally, we could say that the people who were making the hand puppets are the people who are delivering the BI reports, business intelligence reports, or the people in the corporation who were responsible for reporting on what’s actually happening in the organization. Whereas now, with big data in particular, big data analysis, it seems to me that the folks who are in the back of this cave have found some way to release the shackles, and are sort of crawling towards the actual light. With that as sort of an interesting opening salvo, I’ll just throw it over to you to offer your perspective.

Robin: Well, I mean, there’s a certain kind of truth in that perspective, I suppose. You know, you can argue about exactly why the situation the people experience is as it is, but a lot of it actually has to do with the failure of technology. Or, if you like, the limitations of technology. If you think of BI as providing an indistinct picture, that’s reasonably true. It’s very limited in terms of dimensions, you can only see certain things, and those are the things that the BI people have prepared for you. It’s not that they have bad faith, and they’re trying to deceive you, it’s more along the lines of that’s what they thought it would be a good idea to see.

The way that I look at BI, is that BI is a primitive feedback mechanism. A more accurate feedback mechanism is provided by analytics, because there is no constraint on how you can talk to the data to get it to confess to things.

Eric: Yeah, that’s a good point. I should point out that you raise an interesting observation there. I don’t mean to imply that people who are responsible for issuing reports are doing so in some way, shape, or form and a desire to deceive anyone. It’s just a metaphor to describe the limitations of what these people are actually being shown, versus what’s actually happening. You raise a really good observation here in talking about the different dimensions that you can look at. You can think about the history of data management, and the whole industry of data warehousing, while many of the tools applied were designed with various constraints extent.

That is change significantly, meaning processor speeds have increased, storage costs have dropped, all of these things are coming to play to alter, fundamentally, how this whole job can get done. When we look at data warehousing, we are looking at an old world, although we should owe up to the fact that it’s not going to go away. Meaning very few companies are going to rip and replace a data warehouse with some analytics solution. I think you are going to see a slow blend, if you will, of the old way and flow of the new way, or maybe I got that backwards. The idea is that we have this whole new set of capabilities, but they need to be somehow aligned with the old way of doing things in order to get the complete picture of what’s happening, right?

Robin: Yeah, I mean that’s the case. The data warehouse architecture … I mean, one of the characteristics of data warehouse, in terms of the way it was implemented, is that it fed off the OLTP systems, which is the data of the organization, and it became a point from which to distribute it. Whether that was done well, or whether that was done badly depended really on how good an architecture you built, but, you know, there was no doubting the fact that you really needed a large engine to deal with the query traffic that a data warehouse generated. What’s happened, really, is the … A number of things have happened, and it’s very, very difficult to properly sum up exactly what’s happened, because so many things have changed that have made a difference to it.

The cloud is one thing that impacts this, because you can deploy in the cloud far, far faster. Hardware, software, everything far, far faster than you can ever do in the data center. That’s one point. Point 2 is the processor speeds have increased, and continue to increase, but the real gain was in parallelism. That made the old data warehouse architecture, let’s say a little bit passé. What is really required is a scalar architecture if you’re going to have a big engine. In actual fact, you needed a big engine to do the processing of the data that was coming in, because the data wasn’t confined anymore to the OLTP systems that the organization ran.

Suddenly, you could get data from the outside world. Therefore, the population of data that you could process increased, but also the variability of the structure increased, And because an awful lot of what was called on the structure data. Because an awful lot of what was called unstructured data, better thought of as data, whereas the metadata hasn’t been captured, or hasn’t been captured accurately. Those all added new dimensions to what was going on. The reality is, all this is going into … I mean, I don’t think of analytics …

I tend to do a division between what is analytics. I think of the streaming analytics as real-time analytics, providing instant feedback to the organization. The situations wherein speed is required. I regard all analytics as cognitive analytics, which is basically pondering on the nature of the data that you’ve gathered. Trying to work out exactly what’s going on in a way that helps you run the business in a better way. I see a big division, and I see streaming analytics is much closer to old BI, because streaming analytics is always targeted at the situation, and it’s always trying to tell you something that you’ve already understood why you need to know.

So, some individual is sitting there, and they are getting a real-time feed of data that helps them make better decisions in the context in which they are in. That’s not that different than BI, except it’s much more varied, in terms of what it can do, where is the cognitive analytics is really about discovering things that are useful to the business. Those projects can, well, they can go on for a long time, so I see the difference between those two things. It’s a difference between, you know, a human being thinking, and the autonomic nervous system. Those are reasonable parallels.

You can spend days thinking about which is the next car you are going to buy, for example, and you can take an awful lot of different data in order to do it. You don’t know, the next time you buy a car, exactly what information you were going to go out and seek. There is a certain activity that’s exploratory with all of those things, and that is permanently updating itself, because the world out there is permanently changing. I see that as being, you know, a fundamental aspect of what actually happening with data, because all of this data, we’ve got new architectures, new capabilities, and very specifically, we’ve got the speed that parallelism can give you. All of those things are actually being applied to feedback systems.

Eric: Yep. No, that’s a really good point. You kind of bring up this whole area that I wanted to delve into, which is how do you align existing information architectures with all this new interesting stuff that’s coming along. I think, in many cases, the answer is going to be you don’t. Is just a personal opinion here. I think you are going to see a fairly disruptive. Of time as people, and organizations kind of move through a bumpy transition from the old way of doing things to the new way of doing things. With the understanding being you want, from a historical perspective, to know what happened in the past, and you want to get the most out of the systems that you already invested in, but going forward, we are going to see this rather turbulent, fast-moving world that you just described with streaming analytics. I guess my question is, what’s the best path forward for the most forward thinking kind of organization? How do you change horses midstream, and prepare yourself for the future?

Robin: Well, at the end of the day, any particular computing situation, for a question like that, you actually have to take a look from the top down, and look from the bottom up. The top-down view is always … Imagine were starting today, and we are going to build something that satisfies our requirement, what would the ideal architecture be? Quite clearly, the way that one would think now, in terms of principles, is that you really actually have to have a strategy for data in motion, and how you are going to work with data in motion. You know, the world has changed, and were now down to processing events, and the events are actually streaming in all of the time. How are you going to process data in motion? Where and when are you actually going to store data?

The problem is that, in certain situations, it’s a better idea to take the data to the processing, and in certain situations, it’s better to take the processing to the data. The scale-out architectures that come with Hadoop are better at resolving that situation. The assumption with the old data warehouse was that you put all the data together in one place, and you put the processing to that place. Is not happening anymore, that. You know, and there are things like the Internet … The Internet of Things that’s, at this point in time, only in its infancy. It’s quite clear that the data is way too large to actually spend a lot of time moving it about. It’s going to be necessary, in that situation, to take the processing to the data.

Hadoop isn’t that simple a situation. I mean, you know, the initial story of a scale-out file system, you can throw everything into it, It’s turned out to be a reality. People are building multiple Hadoop clusters, because you simply really can’t put all of the stuff in one place, and throw processing at it in that way. This starts to become a data distribution issue, the stuff that you thought wasn’t particularly in motion as well. You know, because you’ve got to distribute it across the computing assets that you’ve got, or into the cloud, the ones that you rent.

Eric: Yeah, and this is a really good point of discussion here to kind of help companies plan their strategy, going forward. There are several different issues we want to dig into, but let’s just kind of tackled this one for starters, which is the path forward, and how can you, from an organizational perspective, leverage the cool, new stuff that’s available, including all this open source technology, which is all over the place these days, while still maintaining continuity with your existing systems. Because again, to rip and replace, that’s pretty scary stuff. You are not going to find too many CIOs, or CTLs, or even CEOs who are going to authorize that kind of wholesale disruption to their organization, because data is the blood of any company. I mean, one way or another, it just is, and it always will be.

The big question in my mind is how can organizations get enough of a handle on what they’ve got, make sure that they retain the critical systems that are running their business, while also embracing what’s new. You know, one of the good pieces of advice I’ve heard over the years, this is very common practice in the real world, is that you build up your new world system, if you will, let’s call it the Hadoop environment. You can use Spark, you can use Kafka, you can use various these other things, to achieve a new vision of where you’re trying to go. You spin that up alongside your existing system, and you run it like that for one, or even two years.

Throughout that process, if you have the resources, and the time, you really make sure to vet this new system and make sure it’s touching all of the appropriate systems, make sure it’s giving you as clear a view as you want, that view maps to the view that you have of the old world, and so forth. Then, after a couple years maybe, you slowly start to turn off those legacy systems, and replace them with some new cloud-based systems, which are all over the place. To me, that’s a fairly sensible approach. It’s very practical, but what do you think?

Robin: Well, you know, the bones of that … I mean, that’s the other side of what I was talking about just now, is the bottom-up approach. It’s like okay, you’ve already got these capabilities, and you’ve already got, in one way or another, certain kinds of latencies that you are trying to beat with these systems that you’ve got in place. They are not going to go without being upgraded, though systems. There isn’t some kind of magic here, where everything new that happens is only going to happen in the new Hadoop clusters. There is also going to be activity that goes on in the existing systems.

The intelligent strategy that you would implement would be that for any given, and bases at the smallest level of granularity, to think of it this way, and then just jump up, but for any given transaction, any given activity, where is the best place for it to execute? In other words, what’s the least cost to meet the service level that’s required? So, you might be moving processing to the data, and you might be moving data to the processing. Well, if you could, been I should think very few companies, if any, could actually do this. If you can map everything that’s happening to a certain level of granularity within your organization, you can gradually work out which things can be migrated, and when.

You know, I mean, the first thing that happens with the advent of Hadoop was the low hanging fruit application of using it as an archive, because it’s an incredibly inexpensive archive. By doing that, what you are actually doing is moving data away from the big expensive data warehouses, and giving them room for growth, to put new data in, but you are just moving cold data that rarely gets accessed to a place where it could be accessed if you really want to. It’s very inexpensive, and that was the first major victory of Hadoop.

You know, and the second major activity was just doing ETL, which was to create a capability of extracting data from relatively unstructured data that you just dropped in, what they now call, the data Lake. That would be, then, fed to the data warehouse. What will happen in time is that Hadoop will gradually become more and more capable of data warehouse kind of transactions. It will start to absorb some of the functions of the data warehouse, until eventually it eats them all.

Now, the thing to understand here is that this isn’t … You know, there are systems that you don’t rip and replace, because they’ve got very specific service levels attached to them, and just maintaining the service levels … It’s like taking an application off the mainframe, and putting it on UNIX, or something like that. The reason that you don’t do it, or that you very rarely do something like that, is you can’t necessarily get the same service level in terms of reliability, and everything like that. You know, as well as performance.

With a data warehouse, you’ve got more leeway, because that data is never being updated in a way OLTP data is updated constantly. It’s only being added to, and therefore you’ve got more leeway in migrating stuff out of the data warehouse. If at some point in time you can provide the service level to some kind of Hadoop-based system, that you can do with the data warehouse, you can get rid of it all. I mean, because the other thing is that the data warehouse is just a distribution point for an awful lot of data. Well, you know, Hadoop can do all of that. That’s not that particularly difficult. That’s how the migration will occur. When it will occur, and why … I mean, you’ve got some people out there that have invested very heavily in an extremely scalable data warehouse that Hadoop isn’t going to get close to for five years. If it ever does.

Eric: Yeah, that’s a really good point. These are all really good points, and I like the fact that you’ve now referenced the top-down versus the bottom-up, and we’ve also kind of dug into process as well. You have all these different dimensions in any organization that actually drive with the company does. There are going to be balls dropped as we shift to new systems. This happens all the time when companies quote, unquote, upgrade their systems. Functionality gets left behind, users get left behind. There are all kinds of crazy things that can happen, but when you look at the drivers, and it seems to me that the next economic downturn will be the definitive driver for adopting open source, and adopting this new way of moving forward.

If you think about all that, all roads come back to Rome, in my opinion, and Rome would be having some kind of relatively holistic, or far-reaching platform that you use to manage your data, and your processes, and so forth. This is arguably the dream of Informatica over these last many years, to serve as that management layer, if you will, for the movement of data, the transformation of data, the delivery of data. For various reasons, I think that they have struggled, I think mostly because they are such an ETL driven company, and financially, and power wise, politically if you will, that really drove much of their revenue. I think it’s difficult to move away from your cash cow, quite frankly. It’s a hard thing to do for an organization.

Nonetheless, they are a major player in that space. There are not too many players, I mean you can argue IBM, certainly. Other organizations have a strong play in the data management platform space, but to me, if you were going to be very serious about transitioning from the old way to the new way, you really should have some kind of far-reaching platform that allows you to get visibility into what’s happening, and allows you to fine tune what’s happening. What do you think?

Robin: Well, I mean, first of all, I think that Informatica is behind the curve, is probably the best thing to say. But, I think all of the old company is behind the curve. What I think is going on right now, is that we are moving toward a kind of end to end data ingestion capability. The word in just kind of implies its new data, but it isn’t necessarily new data. Once you implement any kind of data lake strategy, you have a number of problems. The data that’s moving into the data lake, you need to capture its metadata if it doesn’t exist.

You can capture some of it using machine learning capabilities, that there’s always some exceptions, where you’re going to actually have to have human input to do that. Then, you get the question about data cleanliness. Data cleanliness is something … Data is clean according to the application that needs to use it, so it’s not a simple situation if an analytics application crawls through a heap of data, and doesn’t mind that certain things are inaccurate, because all it’s doing is counting stuff, then it doesn’t necessarily care that much about unclean data. Whereas a different analytics application might care very, very much about unclean data.

You’ve got the reality that data lifecycle didn’t go away. If you are taking data into this kind of data lake area, there is a question as to when you actually throw it out of the lake. You know, maybe it keeps some data forever, and you need to keep it forever, but there is also a lot of data which, having brought it in there, cleansed it, and let analysts loose on it. They discovered that the data hasn’t got any value whatsoever, and you can throw it away.

You’ve got the whole of the security issue … And, as you actually build all of this up, you suddenly realize that, actually, the function of the data lake, above and beyond anything else, is it becomes a governance hub, for governing data. That’s why all of the older vendors are in trouble, because most of that needs to actually be sitting on Hadoop. With that, of course, goes ETL, which is just another aspect of data governance to be honest. We didn’t think of it that way, we thought of taking data from one environment to another, that it wasn’t structured for, but if we had proper data governance, we would also govern the structure of data. Let’s call it the ultimate data governance.

That’s what I see happening. I see all of the old vendors as in trouble in relation to that. They can take what they’ve got, and marry it to a number of the things that’s going on in this Hadoop environment, but … Actually, I’m going to say Hadoop and Kafka from here on in, because …

Eric: You and me both.

Robin: Yeah. The reality is Hadoop doesn’t scale up infinitely, and you can’t put everything on it, so let’s call it Hadoop, and Kafka. It makes no difference. There are in just points where the data has to be governed. There may be reasons to use data before it’s properly governed, simply because there is a latency consideration, but most of the time, if there is no specific latency limit, then you want to govern it before it gets anywhere else. That includes the data that you are dragging into the data Lake from your own system. From the things that, you know, Splunk was processing, all of those log files. Well, that data should really go through the data lake, and be properly governed before. Obviously, if you need to go direct from those log files in order to get the right latency on reporting stuff, well, then you continue to use Splunk.

I think all of the data goes into one kind of, let’s call it a data fabric. It goes into a data fabric. You know, the data fabric consists also of the streaming data, which itself has difficulties attached to it, as to how you’re actually going to manage it.

Eric: That’s exactly right. Folks, we’ll talk about Kafka, and some other interviews in the near future. You can rest assured of that, their summit is coming up. Very, very interesting stuff going on there. Let’s talk about the people side of this equation. It’s really interesting when you sit back, and think about the old way versus the new way. I talk about this all the time in terms of media, and how the media.

The old way in the media is that a handful of people, namely the editors at major publications like the New York Times, and the Washington Post, and major organizations like NBC, and ABC, and CBS, and so forth. They were the ones making the hand puppets, period. They were in charge, they told you what the news was, and you just either accepted it, or you didn’t. You could either go by word-of-mouth, what you heard from your friend, or you could go by what these people told you. Let’s face it, they wielded a tremendous amount of power. Frankly, in my personal opinion, I think they kind of squandered a lot of that power by focusing too much on opinions, and on political agendas, but that’s another topic.

Nonetheless, what has changed now, at least to a large degree, is that people search for what they want. They go to Google, and they look something up. Google is trying to shepherd them, I think, ineffective ways by this whole project hummingbird, I think it’s called, where you can see now as you type in a Google search, it’ll throw suggestions at you, and say “hey, are you looking for this, are you looking for that?” These are all common searches. That’s an attempt to shepherd the search process, but the point is, in the media, we’ve gone from a push model, largely to a pull model, where people are now pulling what they want, as opposed to just accepting.

This, of course, in the world of BI, and analytics, refers to self-service BI. We’ve had some really good conversations lately with some companies that are talking about how you really need to be careful as you move into this self-service world. Certainly with data preparation, but also with data analysis, and delivery, and so forth. Nonetheless, there is this really strong force out there, and it is the force that amounts to the appetite of individual business people for analysis, for insight, for data. All these people out there in the business world are pulling on systems. I think they’ve been somewhat spoiled by online services, and they are wondering why their own internal IT department can’t keep up with that.

The answer, of course, is because we have a very complex world these days. The IT people are working around-the-clock just to keep the trains running on time. They don’t want some new system to throw that all into disruption. You are going to see, again, this rather turbulent. Over the next few years of dealing with that. What do you think about this whole reality of self-service demand, for analytics, for BI, for … I don’t think it’s really there so much for data prep, because who really wants to do data prep? Not many people I know. Most people want to jump right to the fun stuff. What do you think about self-service as a driver for changing the way we work with data?

Robin: Well, you know, within an organization, I mean, there are different situations. I think if you’re going to talk about the consumer situation, it’s genuinely different to the organizational situation, because in an organization, people have roles. It’s the role that needs self-service. When you actually look at self-service, and you say, “Well, what weren’t they getting before,” you often discover that what they weren’t getting before were things that were impossible to provide them with. The complaints that people had, that they couldn’t get a particular data, may well have been that it was way too expensive to provide them with access to it.

There are things, like I remember situations, back in the early data warehouse days, when somebody, often, somebody fairly important would be let loose on the data warehouse, and then put in a query that would consume today’s processing, and destroyed all the other queries, because they were so important, and so on, and so forth. It’s like, you know, the idea of self-service isn’t, you know, an open ended thing. The idea of self-service is to provide people with a certain level of capability that has constraints on it.

The thing that can happen now, because of the whole data lake thing, and the taking in of lots of and external data, and the access to data that was in the organization before, but was very difficult to get access to. Now you can certainly get it a lot more, you know, and now you can do it with not a lot of plumbing required on behalf of the IT department. I mean, it kind of depends, but you’ve still got the same situation. If somebody has got a self-service capability decides to put a query over a heap of data using MapReduce on a Hadoop system, then yeah, it’s going to knock the Hadoop system out for a very long time. It just is. That’s what happens.

You know, the desire for self-service isn’t going to suddenly speed that up. The question is, what do we mean by self-service? If you actually had a coherent organization of a whole kind of data environment within a company, you would give people, or roles the right to access certain data, you still wouldn’t make all data available to them. That would be an absurdity, and actually very dangerous, because you know, they could steal it, and sell it to their friends, or whatever.

Eric: Right?

Robin: You would only make the data for which they had an authority available. You would make it available as fast as it was serviceable to them. You would constrain the amount of resources they could use to play about with it. If they weren’t happy with that, then let them upload it to their PC, and by a Hadoop cluster to plug in the back of their PC, so that they can waste their resources accordingly. At the end of the day, it’s like a user has a role, that role can profit from looking at various pieces of data, authorize them for data, have them request other data to be brought into the data lake, that might be external data, if they need that, but all of it governed by reasonable resource usage, and reasonable service level expectation. It doesn’t change.

That’s how it was before. The problem from before, that everybody bitches about, was that it just took way too long for IT to make stuff available to them, and therefore, they never bothered to ask, and therefore, they thought it was being, you know, they were being deprived of it. In actual fact, in reality, they were being deprived of it, because it wasn’t being made available to them. With Hadoop, a lot of that problem goes away. That doesn’t mean that the self-service capabilities and awful lot of people yearn for are actually sensible to provide people with, because it might just … You know, two days of tying up a very expensive data warehouse runs into tens of thousands of dollars. That’s a neat little mistake to make as a user.

Eric: Yeah, that’s not a good mistake at all. Let’s focus on the last part of the equation here for today, which is the product centric approach. You were just kind of talking about that, you kind of bled into that, which is quite nice. A very lovely segue. As we go forward, let’s face it, we’ve seen this movie before, I suppose I should say, meaning I remember when the big vendors were pushing this concept called standardization. Well, don’t you want to standardize on IBM, or Oracle, or SAP? That way, all of your tools will play together nicely, which, of course, was half true, and half a big fat lie, quite frankly. A lot of these tools, the small, sharper tools get absorbed by the big vendors, and sometimes they get woven in very tightly to the code base, and sometimes they just don’t. I mean, the story plays out year after year, product after product, and so forth. From a product centric point, I think you were already kind of going down that road, namely that, hey, best of breed is good.

When you want a very sharp knife to do a very specific thing, think about the whole manufacturing world in cars, and the whole world of trying to repair those cars. Well, it’s kind of hard for the average one-size-fits-all mechanic to go around, and work on cars these days, because they’ve become so specialized. The engines have become so specialized, you have to buy the tools from the company that manufactured the car, to be able to fix the darn things. It’s kind of like your iPhone. You can’t just pop your iPhone apart, and fix it yourself. You would have to take it to someone, you can’t do that stuff yourself anymore. We are kind of moving in that direction, I think, culturally, overall with technology. With products, it seems like the same thing.

We have these new platforms like Hadoop. Distributions like, of course, Cloudera, and Hortonworks, and MapR, but still, again, that’s forward thinking, whereas the rest of the world is still working with all these legacy systems. From a product centric perspective, what’s some advice you can give to companies who are trying to manage the transition well, and set themselves on a path of data prosperity?

Robin: Well, okay. I mean, let’s just take the fact that, you know, irrespective of what environment any organization has got, they will have, or should have a data strategy, and a data architecture they’re trying to implement that goes with it. Because now we are in a world of data movement, and data flow, the architecture probably needs to come from one vendor, or a small number of vendors who work together perfectly, because the architecture is about the data movement, and it’s about data arriving in the right place when it’s needed, or never leaving the place it started with, because the processing is going to be taken there.

Now, if you’ve got something that’s involved in the data flow that it takes data in, and then the data that it puts out is part of the data flow, then if you go to some idiosyncratic product, you may end up violating your architecture. That would be a bad thing to do if your architecture is well designed. If you’re actually going to have a point product … Let’s talk about point products. A good point product would be a specific implementation of, for example, a machine learning capability, you know. Perhaps it would be, you know, a spot vector thing. It’s going to take the data, and it’s going to give you a result.

The data is going to be furnished up to it. It’s going to give you a result. Once it’s given you a result, you may want to plumb that into something, but you are probably just going to plumb that in by broadcasting that result to something else. Maybe you are going to repeatedly run that algorithm on, let’s say a 10 minute basis, and then pass the data out. You haven’t violated anything, because you haven’t fundamentally interrupted the architecture.

You know, and there are a lot of point situations where you might want to do that, you know, that the architecture itself, I would suspect … I’m pretty much convinced right now, but it’s like time will prove it, that the whole of the data governance thing that has to happen on the ingest of data, including data streams, and the use of data streams, all of that. That needs to sit on a particular architecture. If there’s a component that’s best of breed, that isn’t in alignment with, let’s say doesn’t come from the same vendor, or the same alliance of vendors as the others, but fits perfectly, then it’s probably okay. You just need some kind of reassurance from the vendor that the place you are going to put it, the vendor isn’t suddenly going to change tack, and do something unexpected. That’s probably okay.

It’s much less likely that you will find a point product in that situation that isn’t already part of some kind of natural stack. All of the other point products, it just doesn’t matter, you know, because they are not violating the computer architecture, so it just doesn’t matter. Now, if you are absolutely stuck, because this happens as well. Some company comes along, and they’ve got a specific problem, and they absolutely have to have the point product, because that is an extra thing that you are going to pay for actually having chosen the point product.

By the way, I personally do not think that there is going to become a vendor that will give you a really good architecture for the whole of that data governance piece, anytime soon. I might be wrong, but at the moment, it all looks very early to me. Whether it’s one of the older vendors that’s making big plays, or whether it’s a new vendor that’s gotten themselves a big piece of that action. I don’t think it matters. I don’t think there’s an end to end thing that you can buy right now that satisfies, what I would say were strategic guidelines.

Eric: Yeah, I think that’s about right. I’ll close with one last comment, I suppose, just to wrap up with a philosophical note. I’ve been thinking about all these different quotes. I guess I can drill out to, and you just comment quickly on these. One is categorical imperative, which I think is perfect for data governance, where he says, “Act only according to that maxim whereby you can, at the same time, will as universal law.” I think that should be the regulators’ mantra. Maybe the more important one, or deeper one is Descartes, “I think, therefore I am.” That’s the analyst, right?

Robin: Actually, it’s called the Cartesian fallacy, but actually, as an analyst, you know, “I think, therefore I am” is exactly “I’m cognitive, and therefore I deserve the huge salary that you’re paying me as a data scientist. Yes, I think, therefore I am.” You know, but in terms of existential philosophy, that’s regarded as, let’s say, wrong.

Eric: That’s pretty funny. We wanted to wax philosophical theory at the end, folks, so we threw a couple curve balls. Hey, this has been a lot of fun. Big thanks to IRI, The CoSort company, for sponsoring this philosophical exploration. I will say, you know, that Gwen Thomas loved the idea, Rick Sherman loved the idea, lots of people liked it. I was talking to Andrew Brust, I think, and I mentioned how many analysts just love this whole concept, of Philosophy of Data. He said, “Well, philosophy, and data analytics are both cognitive sciences.” I was like, “Wow. Good point.” Right? I mean, he just nailed it right there, right?

Robin: Oh, yeah. Yeah, they are cognitive, and that is extremely important to the … I mean, you know, the thing that we didn’t discuss, but there’s actually no point in discussing it in any great depth anyway, is the idea is the idea of data standards. You know, if you actually think of the way cognition works within the human being, you actually impose data standards yourself.

The IT world should have much more in the way of data standardization than it has now. I’m sure that will come, because the bigger you make that heap of data, you know, the heap of all the data in the world, the more chaos is going to happen without standardization. At a certain point in time people will be screaming for standardization. Right now, it’s not even a whimper.

Eric: We are going to end with a whimper, not a bang, just like T.S. Eliot. Well done. Okay folks, we’ve been talking to Dr. Robin Bloor, for the Philosophy Of Data. A big thanks to you, big thanks to all of you out there for your interest, and we will catch up to you next time. Take care. Bye-bye.

Leave a Reply

Your email address will not be published. Required fields are marked *