Dr. Geoffrey Malafsky
CEO, Technik Interlytics LLC,
Chief Data Scientist, The Bloor Group
The Data Warehouse has been a core component of data analytics for decades which in the aging process of data management components entitles it to be venerated as an old trusted and loved war horse. Urban legend ascribes the genesis of the DW to early developers and companies starting in the 1980s with a reasonable and quick summary published at DataVersity.
One of the early descriptions is from IBM with their definition of a Business Data Warehouse which brought together data from multiple groups, systems, and time periods to enable more effective and widespread analyses. Another is W. H. Inmon’s book and his many subsequent publications on data warehousing. Inman also viewed the DW as a source of integrated and validated data supporting decision support analysts.
One important aspect of this DW origin is the natural time progression of stated business need versus readily available technology capabilities. What is often perceived as a new business requirement is really the obvious evolution of awareness of options and desire to use newly possible computer functions for different parts of the typical lifecycle of information, analysis, and decision making. There is a natural work activity sequence and tasks for many human endeavors that does not change. Technology enables greater efficiency, accuracy, and scale but considering the Egyptian and Mayan pyramids were entirely made with human labor and simple machines we should not be so audacious to presume computers create new business needs instead of enhancing them and making them doable. This is why Da Vinci’s technology inventions like the helicopter not realized for hundreds of years demonstrates human brilliance not ethereal alien communication. Similarly, another brilliant scientist Robert Boyle 17th century predictions included light but hard armor, powered ships, and finely shaped glass.
We are fortunate to have multiple new technology capabilities available that offer significant capabilities to enhance work activities and automation. Before we discuss these technologies, we will revisit the basic characteristics of DW to understand where opportunities may exist for advanced technology assistance. First, why the name Data Warehouse? This is purposefully intuitive, which is very much appreciated and kudos to the early experts, so the emphasis is on the nature of a warehouse. A warehouse is “a structure or room for the storage of merchandise or commodities”.
Of course, this simple definition is not the full semantic meaning in common usage. Typically, when we talk about a warehouse, we include the implied characteristics of its workflow processes as well, such as organizing storage, maintaining inventory, etc. Thus, there is a direct analogy to managing data for business analysis as opposed to transactional data systems. The basic nature of a warehouse transcends use cases from physical goods to non-physical data. Indeed, the current concerns in physical goods warehousing can be used as-is to describe DW issues.
- Inventory accuracy
- Inventory location
- Space utilization and warehouse layout
- Redundant processes
- Picking optimization
The primary DW interest is to have a full set of validated data about the business that can be used for many use cases. While a simple need statement, it represents many difficult fields in technology, organizational dynamics, and work activities. The adjective “validated” itself it is a major and long-term challenge to solve. How do know it is valid? Is it valid for all reports and queries? Will it be valid when new data is added? How can you find and correct errors? Can the technology store the breadth of data desired, parsing it all, cyclically interrogating for model-based computing or Machine Learning, and ad doc querying multiple levels of data over multiple time frames?
These are the activities needed and which could be done by a large number of trained people, such as actuarial work before widespread use of computers. Until recent years, major technology advances were logarithm tables, the slide rule, and mechanical calculators starting in the 1600s. The first wave of data warehousing arose when computing technology shrunk in size and cost from the early behemoths into feasible business tools. It became apparent that computers could be used for more than just a small number of absolutely critical operations and could support mainstream business. This included differentiating between computers for transactional data processing and those for more human interactive analytical activities. DW moved ahead again into mainstream business operations when computer cost, size, and complexity reduced dramatically allowing it to spread into more departments to support multiple use cases. This spread itself accentuated the need for centralized warehouses to make it easier to generate integrated analytics for the corporate level.
Now, we have the confluence of several dramatic shifts in both technology and the business model of commercial service providers.
- Big Data: this is a group of technologies focused on the capabilities to handle extremely large quantities of data in storage, processing, and query at speeds common to prior generation data technology handling moderate data set sizes. This includes Hadoop and its composite group of tools, in-memory computing, parallel computing, and Machine Learning
- Cloud environments: these hosted and managed computers provided in cluster mode and virtualized levels of storage and functionality. It is essentially the next step is outsourced computing taking advantage of the Big Data and other cluster computing technology, as well as service support for technology refresh, development and operations (DevOps), and cybersecurity. It specifically takes advantage of the tremendous growth of widespread high-speed internet networking and high Quality of Service (QoS).
- Low cost business models: Service providers do not have to offer very low prices, often free, to use advanced computing resources but this has become the most common approach to online supplied services. This makes it very cost effective to shift to internet supplied capabilities instead of on-premise only.
The key is having these three new capabilities available at the same time. While each is important and valuable by itself, the real opening for a new generation of DW capabilities is being able to exploit the combination. Therefore, we can consider the current upgrade in DW as more than a minor enhancement and instead a significant modernization. Scale, flexibility, cost, access, maintenance, and security are all being improved at the same time and in a coordinated manner.
This modernized DW supports:
- Hybrid Cloud architecture where data can reside in a combination of outsourced and on-premise data environments
- Big Data scale in exabytes
- Cluster computing for parallelism during calculations, queries, and interactive dashboards
- Aggregation of many disparate sources and formats
- Internet of Things and other high-volume data feeds
- Machine Learning and other sophisticated algorithms in the Artificial Intelligence family
- Flexible use of myriad data models, structures, and other granular design options
- Flexible pricing including scaring as needed
- Service Level Agreements for security, technology updates, and management
- Multiple levels of user roles, access, audit
We are at the forefront of this new DW capability expansion. It merits close attention in the near-future to see how the price-capability-function trade-offs will settle. No better way to do so than tune into Inside Analysis weekly.
 Merriam-Webster definition is “something (such as a work of art or musical composition) that has become overly familiar or hackneyed due to much repetition in the standard repertoire”, https://www.merriam-webster.com/dictionary/warhorse, accessed 20181125
 Paul Williams, A Short History of Data Warehousing, 20120823, http://www.dataversity.net/a-short-history-of-data-warehousing/, accessed 20181125
 B. A. Devlin and P.T Murphy, An architecture for a business and information system, IBM Systems Journal, Vol 27(1), 1988
 W. H. Inmon, Building the Data Warehouse, John Wiley & Sons, Inc. New York, NY, 1992
 F. Henderson, What scientists want: Robert Boyle’s to-do list, The Royal Society, 20100827, at https://blogs.royalsociety.org/history-of-science/2010/08/27/robert-boyle-list/, accessed 20181125
 https://www.merriam-webster.com/dictionary/warehouse, accessed 20181125
 C. G. Lewin, et al., Calculating devices and actuarial work, Journal Institute of Actuaries, Vol 116, 1989, pg 215