“When I was small, and Christmas trees were tall”1 and data warehouses were gigabyte-sized, integration was something you prepared months for, broke your budget on and performed in the dead of night. Today, when big data is measured in petabytes and is loosely structured and of dubious cleanliness, you are expected to integrate on demand, preferably smiling sweetly at your business users while you do so. How can you be expected to do this?
Integration, be it data or information, is a term widely used in business intelligence and now big data. Unfortunately, its meaning has morphed with marketing overuse and is often a cause of confusion. This article clarifies the confusion by describing three functional aspects of integration that are often mixed together or partially missed, depending on the context. We start at the beginning.
Tune in to hear David Besemer of Composite brief author Barry Devlin about data virtualization technology in The Briefing Room
A key characteristic of the first data warehouse architecture2 back in the mid-1980s was a recognition that data — even within the walls of the enterprise — was most often created in disparate systems that were never designed to work together, even though the data they contained overlapped to greater or lesser extents. In order to combine such data into an enterprise data warehouse, it had to be made consistent. This requirement, called reconciliation or integration, together with a set of functions to cleanse, aggregate, enrich and more came to be called transformation, the T of ETL (extract, transform and load) tools, by the late 1990s. This was data integration at its simplest. Although some enterprise modeling might have been done to figure out what the business meant when they said “customer,” most of the real work was carried out at the database and even system levels to figure out if the six-digit customer number in one system could be matched to the eight-digit id in the next and what processing and conversions would be required to make it happen. And given that the business was happy to see the results after month-end or occasionally the following morning, and the computers were overloaded during business hours (as well as a few other constraints), this data integration was run as an overnight batch job. We can call this first functional aspect prior integration, given that it is carried out to create a consolidated data store prior to any business use of the data involved.
By the early 2000s, we had a new kid on the integration block, enterprise information integration (EII). Evolving from the concept of federated query, across both homogeneous and heterogeneous sources, EII is very different to prior integration. In this case, integration occurs in real time, accessing the relevant data sources, performing queries against them and combining the results into a single answer to the original request. Although the phrase enterprise information integration is still used, in recent years it has been largely superseded by data virtualization. This change in terminology is largely positive; such processing is seldom enterprise-wide and much more about data than information, although the loss of the integration word is a pity. Let’s call this functional aspect of integration immediate integration.
Both aspects have their pros and cons, of course. Prior integration provides a better opportunity to tackle more difficult problems and may be the only option when data sources are not synchronized in time. This latter problem, caused by end-of-period reconciliation activities, was very common in financial institutions in the past, but is becoming less prevalent as commerce is increasingly globalized and electronic. Immediate integration provides far more timely responses to business needs, although individual queries may perform poorly due to network speed or overloaded servers. Furthermore, the tools designed primarily for one aspect or the other — ETL and data virtualization tools — have tried to mitigate their shortcomings by moving toward the opposite aspect. ETL tools address timeliness issues by the use of incremental or micro-batch approaches; meanwhile, data virtualization tools implement semi-permanent cache stores to improve performance.
The observant reader may notice two points. First, although I’ve delineated two disparate aspects of integration, they really exist more on a continuum, as evidenced by the ability of vendors to cross from one to the other. At least one vendor refers to the entire spectrum as data integration, which is actually valid. Second, we have focused on data to the exclusion of information. These points are addressed by the third functional aspect of integration, concept integration. This raises our view from data to information and enables the integration of integration itself!
Data, as explained in my forthcoming book “Business unIntelligence,”3 is simply computer-optimized information, with information being the recorded human perception of reality. Modeling is the primary means of transforming information into data. (Business intelligence allegedly transforms data into information.) And concept integration is a key part of modeling when — as is usual today — we have to work with existing data sources, be they internal or, increasingly, externally sourced. Concept integration involves knowledgeable professionals understanding and interpreting business meaning, how it has been represented in data stores and the processing that went into their population—on the fly and using text analytic tools, for example. When done properly, we end up with true information integration and ensure that prior and immediate integration can work together to provide consistent and correct data integration irrespective of timeliness and performance constraints or the use of different tools.
“Now we are tall, and Christmas trees are small…” There you have it. The integration dilemma solved — conceptually, at least. I leave the practical challenges to the vendors: to devise viable tools and methods for concept integration and to define and store the resultant metadata in an open, standard store that can be used freely by all tools!
1 Barry, Robin and Maurice Gibb, BeeGees, “First of May”, 1969
2 Devlin, B. A. and Murphy, P. T., “An architecture for a business and information system”, IBM Systems Journal, Volume 27, Number 1, Page 60 (1988) http://bit.ly/EBIS88
3 Barry Devlin, “Business unIntelligence — Via Analytics, Big Data and Collaboration to Innovative Business Insight”, to be published by Technics Publications in 3rd quarter 2013.