Inside Analysis

Articles

devlin“When I was small, and Christmas trees were tall”1 and data warehouses were gigabyte-sized, integration was something you prepared months for, broke your budget on and performed in the dead of night. Today, when big data is measured in petabytes and is loosely structured and of dubious cleanliness, you are expected to integrate on demand, preferably smiling sweetly at your business users while you do so. How can you be expected to do this?

Integration, be it data or information, is a term widely used in business intelligence and now big data. Unfortunately, its meaning has morphed with marketing overuse and is often a cause of confusion. This article clarifies the confusion by describing three functional aspects of integration that are often mixed together or partially missed, depending on the context. We start at the beginning.

Tune in to hear David Besemer of Composite brief author Barry Devlin about data virtualization technology in The Briefing Room

A key characteristic of the first data warehouse architecture2 back in the mid-1980s was a recognition that data — even within the walls of the enterprise — was most often created in disparate systems that were never designed to work together, even though the data they contained overlapped to greater or lesser extents. In order to combine such data into an enterprise data warehouse, it had to be made consistent. This requirement, called reconciliation or integration, together with a set of functions to cleanse, aggregate, enrich and more came to be called transformation, the T of ETL (extract, transform and load) tools, by the late 1990s. This was data integration at its simplest. Although some enterprise modeling might have been done to figure out what the business meant when they said “customer,” most of the real work was carried out at the database and even system levels to figure out if the six-digit customer number in one system could be matched to the eight-digit id in the next and what processing and conversions would be required to make it happen. And given that the business was happy to see the results after month-end or occasionally the following morning, and the computers were overloaded during business hours (as well as a few other constraints), this data integration was run as an overnight batch job. We can call this first functional aspect prior integration, given that it is carried out to create a consolidated data store prior to any business use of the data involved.

By the early 2000s, we had a new kid on the integration block, enterprise information integration (EII). Evolving from the concept of federated query, across both homogeneous and heterogeneous sources, EII is very different to prior integration. In this case, integration occurs in real time, accessing the relevant data sources, performing queries against them and combining the results into a single answer to the original request. Although the phrase enterprise information integration is still used, in recent years it has been largely superseded by data virtualization. This change in terminology is largely positive; such processing is seldom enterprise-wide and much more about data than information, although the loss of the integration word is a pity. Let’s call this functional aspect of integration immediate integration.

Both aspects have their pros and cons, of course. Prior integration provides a better opportunity to tackle more difficult problems and may be the only option when data sources are not synchronized in time. This latter problem, caused by end-of-period reconciliation activities, was very common in financial institutions in the past, but is becoming less prevalent as commerce is increasingly globalized and electronic. Immediate integration provides far more timely responses to business needs, although individual queries may perform poorly due to network speed or overloaded servers. Furthermore, the tools designed primarily for one aspect or the other — ETL and data virtualization tools — have tried to mitigate their shortcomings by moving toward the opposite aspect. ETL tools address timeliness issues by the use of incremental or micro-batch approaches; meanwhile, data virtualization tools implement semi-permanent cache stores to improve performance.

The observant reader may notice two points. First, although I’ve delineated two disparate aspects of integration, they really exist more on a continuum, as evidenced by the ability of vendors to cross from one to the other. At least one vendor refers to the entire spectrum as data integration, which is actually valid. Second, we have focused on data to the exclusion of information. These points are addressed by the third functional aspect of integration, concept integration. This raises our view from data to information and enables the integration of integration itself!

Data, as explained in my forthcoming book “Business unIntelligence,”3 is simply computer-optimized information, with information being the recorded human perception of reality. Modeling is the primary means of transforming information into data.  (Business intelligence allegedly transforms data into information.) And concept integration is a key part of modeling when — as is usual today — we have to work with existing data sources, be they internal or, increasingly, externally sourced. Concept integration involves knowledgeable professionals understanding and interpreting business meaning, how it has been represented in data stores and the processing that went into their population—on the fly and using text analytic tools, for example. When done properly, we end up with true information integration and ensure that prior and immediate integration can work together to provide consistent and correct data integration irrespective of timeliness and performance constraints or the use of different tools.

“Now we are tall, and Christmas trees are small…”  There you have it. The integration dilemma solved — conceptually, at least. I leave the practical challenges to the vendors: to devise viable tools and methods for concept integration and to define and store the resultant metadata in an open, standard store that can be used freely by all tools!

References

1 Barry, Robin and Maurice Gibb, BeeGees, “First of May”, 1969
2 Devlin, B. A. and Murphy, P. T., “An architecture for a business and information system”, IBM Systems Journal, Volume 27, Number 1, Page 60 (1988)  http://bit.ly/EBIS88
Barry Devlin, “Business unIntelligence — Via  Analytics, Big Data and Collaboration to Innovative Business Insight”, to be published by Technics Publications in 3rd quarter 2013.

About Barry Devlin

Dr. Barry Devlin is among the foremost authorities on business insight and one of the founders of data warehousing, having published the first architectural paper on the topic in 1988. With over 30 years of IT experience, including 20 years with IBM as a Distinguished Engineer, he is a widely respected analyst, consultant, lecturer and author of the seminal book, “Data Warehouse—from Architecture to Implementation” and numerous White Papers. His 2013 book, “Business unIntelligence—Insight and Innovation beyond Analytics and Big Data” is available as hardcopy and e-book. Barry is founder and principal of 9sight Consulting. He specializes in the human, organizational and IT implications of deep business insight solutions that combine operational, informational and collaborative environments. A regular contributor to BeyeNETWORK and TDWI, Barry is based in Cape Town, South Africa and operates worldwide.

About Barry Devlin

Dr. Barry Devlin is among the foremost authorities on business insight and one of the founders of data warehousing, having published the first architectural paper on the topic in 1988. With over 30 years of IT experience, including 20 years with IBM as a Distinguished Engineer, he is a widely respected analyst, consultant, lecturer and author of the seminal book, “Data Warehouse—from Architecture to Implementation” and numerous White Papers. His 2013 book, “Business unIntelligence—Insight and Innovation beyond Analytics and Big Data” is available as hardcopy and e-book. Barry is founder and principal of 9sight Consulting. He specializes in the human, organizational and IT implications of deep business insight solutions that combine operational, informational and collaborative environments. A regular contributor to BeyeNETWORK and TDWI, Barry is based in Cape Town, South Africa and operates worldwide.

5 Responses to "The Integration Dilemma"

  • Geoffrey Malafsky
    May 20, 2013 - 1:21 pm Reply

    This is a well written and insightful article. I especially like your take on BuI. We are addressing this exact space with Data Rationalization and Virtualization methodology and enabler tools. Yet, I wonder what your perspective is on the major hurdle to instantiating your ideas, namely the pervasive lack of serious problem solving in data mgmt which is exacerbated by the consultant class feeding a legacy narrative of “try this, oops, it failed, so now try this……..”.

  • Barry Devlin
    June 5, 2013 - 4:30 am Reply

    Thanks for your comment, Geoffrey.
    I do believe that the majority of project failures are organizational rather than technology-related. And that the consulting utilization-driven ethos is not helping either. The solution will be for business to shift its attitude from “IT as a necessary evil” to “IT as business equal” to build a biz-tech ecosystem. But, it may take a while…

    Barry.

  • John O'Gorman
    August 12, 2013 - 12:01 pm Reply

    First of all, I want to congratulate you for the absolute *best yet* definition (and distinction) of data and information. Brilliant.

    I also like the description of the three aspects of integration; my pet peeve in the area is neither the word nor the process seems to take semantics or temporality into account

    My comment comes in the form of a question: Is it possible to imagine a growing layer of persistence to the concept integration process? In other words, if we can establish that the “…the recorded human perception of reality.” has a fixed set of common layers and slices we should be able to make some of the data layers consistent across any application or set of applications.

    I see this in a similar light as the creation of the periodic table to manage persistent chemical properties and rules in the potential creation of a limitless number of combinations.

    • Barry Devlin
      August 13, 2013 - 3:17 am Reply

      Thanks John. Glad you liked the definition!
      I can certainly *imagine* such a persistence layer to take account of semantics and temporality, and agree it is needed. However, I suspect it is orders of magnitude more complex than the periodic table. Do you have thoughts on the type of content and how it could be collected and managed?
      Regards, Barry.

      • John O'Gorman
        August 16, 2013 - 11:08 am Reply

        Hi Barry

        If you can send me an email address where I can send you an overview that would be great! And yes, I have developed such a model and recently found an engine for collecting and managing the data in the way I want to have it done!

        Best regards and looking forward to talking some more.

        John O’

Leave a Reply

Your email address will not be published. Required fields are marked *