Inside Analysis

Weaving a Data Fabric

Dr. Geoffrey Malafsky

CEO, Technik Interlytics LLC,

Chief Data Scientist, The Bloor Group

 

A Data Fabric is a framework of networking, security, parsing techniques, management tools, and other components that work together to allow seamless operation and management of data systems across boundaries. These boundaries include vendor tools, geographic locations, network enclaves, and logical designs like database schemas[1]. As such, the Data Fabric is akin to the notions of Enterprise Service Bus (ESB) and semantic abstraction layers in the sense that a single human interactive environment handles the plethora of technical details required to make it work.

With any serious integration of distributed components of differing types, a new problem arises to manage it and develop new applications within it in an efficient manner. Brute force always works but it is labor and time intensive and prone to error. This need for more facile development and operation (DevOps) extends throughout the stack of logical design, software, and hardware leading to new market offerings even at the data storage layer[2].

However, the Data Fabric would be just another faddish term for extant tools and methods (a rampant issue in information technology) if it wasn’t based on real growth in the data processing and analytics fields from both technological and business capabilities. This is driven by the combination of several important marketplace offerings:

  1. Big Data: this is a group of technologies focused on the capabilities to handle extremely large quantities of data in storage, processing, and query at speeds common to prior generation data technology handling moderate data set sizes. This includes Hadoop and its composite group of tools, in-memory computing, parallel computing, and Machine Learning
  2. Cloud environments: these hosted and managed computers provided in cluster mode and virtualized levels of storage and functionality. It is essentially the next step is outsourced computing taking advantage of the Big Data and other cluster computing technology, as well as service support for technology refresh, development and operations (DevOps), and cybersecurity. It specifically takes advantage of the tremendous growth of widespread high-speed internet networking and high Quality of Service (QoS).
  3. Low cost business models: Service providers do not have to offer very low prices, often free, to use advanced computing resources but this has become the most common approach to online supplied services. This makes it very cost effective to shift to internet supplied capabilities instead of on-premise only.

The concept of a data fabric is wonderful but between concept and production grade reality lies many complicated technical concerns that must work reliably alone and in concert with the others. Achieving high service levels is complicated enough that computer simulations are used to understand the influences of various factors[3]. Indeed, the data fabric is partly based on the older concept of a data grid that was particularly suited for large scale inter-organization science research[4]. The key characteristics of the data grid are true for data fabric such as: massive datasets; shared data collections; unified namespace; access restrictions[5]. Other operational requirements for the data fabric are:

  • Hybrid architecture with data in a combination of outsourced and on-premise environments
  • Scale in terabytes at least, and possibly up to petabytes or even exabytes
  • Parallel computing and ad hoc queries
  • Aggregation of disparate data schemas, formats, and hidden semantics
  • Processing high-volume feeds
  • AI, Machine Learning and other sophisticated algorithms
  • Multi-level cybersecurity

One salient aspect feeling the dramatic increase of powerful and low-cost data technology is the open source nature of Big Data. Allowing its core methods to be exposed to a wide developer community without risk of violating Intellectual Property concerns has fostered extensive experimental and product offerings. Apache has many projects for the needed components of Big Data and recently Data Fabric. One such project is Apache Ranger for data security[6]. Another is Apache Atlas which serves the governance and associated metadata management requirement[7].

While individual Apache project software can be collected, installed, and configured, this will require a very high level of expertise and patience to make it work seamlessly and reliably as a single environment. Thus, the real power of the data fabric will come from suppliers who do this work and provide an integrated and tested distribution. HortonWorks has done so with their HortonWorks DataPlane Service which importantly is part of a broader global data management strategy [8].

Hortonworks DataPlane is a portfolio of data solutions that enable enterprises to consistently manage, secure, govern and optimize data their data stored in on-premises data centers and in the cloud to institute an effective hybrid data strategy.

Their growing impact on data fabric technology was described in the recent Forrester report[9]. This will be accentuated by the Cloudera- HortonWorks merger allowing both sets of Big Data tools to blend for a more complete and reliable solution[10]. There are many use cases for the data fabric with the market expected to substantially grow especially for fraud detection, marketing, and compliance[11].

 

[1] D. Kusnetzky, What is a data fabric and why you should care?, NetworkWorld, 20170919, https://www.networkworld.com/article/3226393/data-management/what-is-a-data-fabric-and-why-should-you-care.html, accessed 20181201

[2] J. Webster, Big Data meets data fabric and multi-cloud, Forbes, 20180112, https://www.forbes.com/sites/johnwebster/2018/01/12/big-data-meets-data-fabric-and-multi-cloud/#1337638812b6, accessed, 20181201

[3] S. Kounev, K. Bender, F. Brosig, N. Huber, R. Okamoto, Automated simulation-based capacity planning for enterprise data fabrics, SIMUTools11 Proc 4th Intl ICST Conf Simulation Tools Techniques, pg 27-36, 2011

[4] A. Chervenak, I. Foster, C. Kesselman, C. Salisbury, S. Tuecke, The data grid: towards an architecture for the distributed management and analysis of large scientific datasets, J Network Computer Applications, 23(3), pg 187-200, 2000

[5] S. Venugopal, R. Buyya, K. Ramamohanarao, A taxonomy of data grids for distributed data sharing, management, and processing, J ACM Computing Surveys, 38(1), 2006

[6] Apache Ranger, https://ranger.apache.org/, accessed 20181202

[7] Apache Atlas, https://atlas.apache.org/, accessed 20181202

[8] HortonWorks DataPlane, https://hortonworks.com/products/data-platforms/dataplane/, accessed 20181202

[9] The Forrester Wave: Big Data Fabric, Q2 2018, https://www.forrester.com/report/The+Forrester+Wave+Big+Data+Fabric+Q2+2018/-/E-RES141570, accessed 20181202

[10] N. Yuhanna, Cloudera and Hortonworks merge: A win-win for all, Forrester, 20181004, accessed 20181202

[11] Data Fabric Market, MarketsAndMarkets, May 2017, https://www.marketsandmarkets.com/Market-Reports/data-fabric-market-237314899.html, accessed 20181202

Leave a Reply

Your email address will not be published. Required fields are marked *