Within the world of business intelligence (BI) there is always much philosophical debate on what information architects refer to as “context.” However, while the debates of “one version of truth” versus “multiple perspectives” continue, we forget to realize the influence of context already embedded within the data.
Watch John O’Brien in The Briefing Room with Teradata
Context can be the invisible lineage within data and not necessarily the data itself. The simple act of profiling or analyzing data in database tables has already inherited context. When you look through rows in a customer table in the data warehouse or rows of metrics in a fact table in a data mart, what do you see? That customer table has already organized data elements within the context of a provided definition of a customer. Perhaps that definition was derived from another operational systems’ definition of the context needed to process transactions associated with customers. And, perhaps the ancestors of that context were requirements for business processes from various business functions.
Often the data we work with every day is created, captured, transformed, and stored by this method of evolution, carrying with it a predefined definition of “context.” The best data modelers can see beyond this inherited context, rediscover the truth of what is being modeled, and provide access to data within a well-defined, consistent context. This article discusses the value of structured and unstructured data, and how we can bridge the gap between using both metadata and modern BI architectures to unite data and context, realizing the full potential of our BI. With over 25 years of experience delivering value through data warehousing and BI programs, John O’Brien’s unique perspective comes from the combination of his roles as a practitioner, consultant, and vendor CTO in the BI industry. As a recognized thought leader in BI, John has been publishing articles and presenting at conferences in North America and Europe for the past 10 years. His knowledge in designing, building, and growing enterprise BI systems and teams brings real world insights to each role and phase within a BI program. Today, John provides research, strategic advisory services and mentoring that guide companies in meeting the demands of next generation information management, architecture, and emerging technologies.
Moving Between Structured to Unstructured Data
Analysts and information consumers working with data in structured databases benefit from data being already organized for them. One of the highest values of structured databases is the mass availability of tools that connect to structured databases and work in a relational or dimensional structure with standardized mature languages, such as Structured Query Language (SQL) or Multidimensional Expressions (MDX).
Unfortunately, structured data does not handle changes to its context or defined rules very easily. In today’s applications, users are able to generate new data or change data in massive volumes. Because of this, unstructured databases like Apache Foundation’s open source Hadoop project have emerged and gained enormous traction. Storing data in Hadoop’s name/value pairs strips away the obstacles of context that inhibit constant change. This change in structure allows analysts and data scientists to easily discover new patterns and clusters that were previously hidden in old definitions of context.
In order to gain the flexibility to handle change delivered by unstructured databases, we had to sacrifice the connectivity benefits of structured databases that users depend on to access data. However, context can live as part of the access layer to physically define the lens you need to work with data. We have the ability to define context between the user and the data through a data management layer referred to as data virtualization, a semantic layer, data abstraction, business information layers, or sometimes simply as database views. These objects are forms of metadata that “map” physical data to a virtual structure and are transparent the user.
This new intersection of the structured and unstructured worlds that we can refer to as a “mapping” layer will bring the unstructured data world into the BI community at large, but should not be considered a structured database data due to the lack of many mature database features, such as security or transaction integrity. This does not mean that Hadoop is not a mission-critical platform at many companies, only that you cannot assume that all relational database features are always available simply because it looks like you’re accessing a table in Hadoop.
Bridging the Gap with Metadata
Modern BI architecture has the ability to work with and deliver value from data stored in both well-defined structures and data stored in unstructured forms. These two halves of a data warehouse can be bridged through understanding where and how context lives in the architecture.
For BI processes that involve data discovery, profiling, and analysis, BI analysts can iterate on a business definition easily through redefining its corresponding metadata. This agility is one of the benefits typically touted from the use of data virtualization. Data stored in Hadoop is no longer confined to only those few who possess MapReduce skills. A larger portion of the BI community can explore data and allow their definitions to be leveraged in HCatalog for many more users of traditional tools. HCatalog from the Hadoop developers defines a context for the unstructured data in the Hadoop environment for access by other Hadoop interfaces, databases and BI tools. HCatalog is a key milestone in the maturing of Hadoop for the industry and helps bridge user access through the use of metadata.
The value of the unstructured data typically stored within a Hadoop environment has not been determined yet, and the best place to store data of unknown value should, of course, be the place with the lowest cost structure. With Hadoop designed to operate on the lowest cost commodity hardware, and because it was designed to effortlessly deal with massive data set scalability and economic, it is often the most ideal place to store big data volumes of undiscovered value. This begs the question, “Should the entire data warehouse staging area be in Hadoop?”
Modern BI Architectures that are Integrating Hadoop
Modern BI architectures are rediscovering how to architect data warehouses of the future. The understanding that context and data are related in many ways is just as critical today as it will be in the future. Enabling the integration of the unstructured and structured data world is a good sign and critical to the BI world.
Today, BI architects are exploring three leading modern BI architectures where abstraction layers extend to Hadoop, enabling access for many more users that require context. In the first generation of integrated architecture, the goal is to provide context to the database or BI tool for access or extraction into its environment. This integration is with the presumption that it will be better in a context-structured technology and be available to the most users via decades of access tools for databases.
The second integration architecture being considered is one where Hadoop is leveraged as the data warehouse staging area to allow all data to enter environments with change being mostly insulated. Data integration tools and structured databases can extract from the Hadoop staging area much easier now via projects like HCatalog, rather than PIG or Hive interfaces. Sophisticated MapReduce programs can also be written to extract data to files, which are loaded in the data warehouse or specialized data marts, when SQL is too limiting and big data requires performance,
Finally the third modern BI architecture, and the furthest out for most companies and data warehouses, is the concept that the entire data warehouse can live in a Hadoop environment. Here context is defined only the abstraction layer, and satellite data marts are used for specialized applications and BI workloads such OLAP.
Bridging the unstructured world requires a thorough understanding of how to discover, define, and decide where context should exist, and if it should exist as stable, flexible, or at all. Once we answer these questions we can continue to analyze the BI processes and constructs that are best enabled by different components of our data warehouse platform.