Inside Analysis

Hadoop: Is There a Metadata Mess Waiting in the Wings?

Why is Hadoop so popular? There are many reasons. First of all it is not so much a product as an ecosystem, with many components: MapReduce, HBase, HCatalog, Pig, Hive, Sqoop, Mahout and quite a few more. That makes it versatile, and all these components are open source, so most of them improve with each release cycle.

But, as far as I can tell, the most important feature of Hadoop is its file system: HDFS. This has two notable features: it is a key-value store, and it is built for scale-out use. The IT industry seemed to have forgotten about key-value stores. They used to be called ISAM files and came with every operating system until Unix, then Windows and Linux took over. These operating systems didn’t provide general purpose key-value stores, and nobody seemed to care much because there was a plethora of databases that you could use to store data, and there were even inexpensive open source ones. So, that seemed to take care of the data layer.

But it didn’t. The convenience of a key-value store is that you can put anything you want into it as long as you can think of a suitable index for it, and that is usually a simple choice. With a database you have to create a catalog or schema to identify what’s in every table. And, if you are going to use the data coherently, you have to model the data and determine what tables to hold and what attributes are in each table. This puts a delay into importing data from new sources into the database.

Now you can, if you want, treat a database table as a key-value store and define only the index. But that is regarded as bad practice, and it usually is. Add to this the fact that the legacy databases were never built to scale out and you quickly conclude that Hadoop can do something that a database cannot. It can become a data lake – a vast and very scalable data staging area that will accommodate any data you want, no matter how “unstructured” it is.

So if someone in the company wants some external data or even internal data captured for later use, Hadoop can just sit there and drink it up. And that’s fine as long as you don’t lose track of what the data in the lake actually is. But this is where the devil crawls into the detail. You can scale Hadoop out so it becomes just one very large data lake and sits there gulping down all the data it can drink. You can also instantiate multiple instances of Hadoop, each devoted to a specific kind of usage, but we do not often hear about IT sites doing that – after all Hadoop scales out to the edge of the solar system, does it not?

Then we encounter the fact that quite a few IT sites use Hadoop as a natural storage location for intermediate files, and that can be convenient too. This happens quite frequently with data analytics, because some data analytics activity generates intermediate files and if you’ve decided to “go round the database” you may be happy to use Hadoop as the foundation of a general data analytics capability and foreswear databases forever or at least until you discover that you need them.

Revelytix and the Potential Metadata Mess

So there is a potential metadata mess that may be gradually building up due to hasty Hadoop exploitation. And like most IT problems, you may only discover that it’s out of control when it decides to sit firmly on your knee. I think it is a good idea to forestall such a possibility, and if you are going to do so, then you need to keep track of the data. You can do this manually, of course. You can declare and implement (with an iron fist) some Hadoop best practices. The problem with that, however, is that you may bring back some of the data latency that you hoped to diminish with Hadoop, and manual procedures have a terrible habit of generating errors.

There is another possibility: to automatically gather the metadata directly from Hadoop. At the moment I know of only one technology that does this in a comprehensive way, and it comes from Revelytix. Revelytix employs semantic technology, which gives it most of its power, especially when dealing with seriously unstructured data. In most cases it can work out what data is in a record without any help. Additionally – I found this quite surprising, but incredibly sensible – it tracks the provenance of data. This means it can know what data came from where. For professional data analytics this is more than just a time-saver, it’s a necessity.

So all is not lost. In fact, maybe nothing is lost. In fact, the cavalry may have arrived a little ahead of time. But if you don’t recognize this problem before it proliferates, I suspect it’s going to cost you real dollar$.

 

Robin Bloor

About Robin Bloor

Robin is co-founder and Chief Analyst of The Bloor Group. He has more than 30 years of experience in the world of data and information management. He is the creator of the Information-Oriented Architecture, which is to data what the SOA is to services. He is the author of several books including, The Electronic B@zaar, From the Silk Road to the eRoad; a book on e-commerce and three IT books in the Dummies series on SOA, Service Management and The Cloud. He is an international speaker on information management topics. As an analyst for Bloor Research and The Bloor Group, Robin has written scores of white papers, research reports and columns on a wide range of topics from database evaluation to networking options and comparisons to the enterprise in transition.

Robin Bloor

About Robin Bloor

Robin is co-founder and Chief Analyst of The Bloor Group. He has more than 30 years of experience in the world of data and information management. He is the creator of the Information-Oriented Architecture, which is to data what the SOA is to services. He is the author of several books including, The Electronic B@zaar, From the Silk Road to the eRoad; a book on e-commerce and three IT books in the Dummies series on SOA, Service Management and The Cloud. He is an international speaker on information management topics. As an analyst for Bloor Research and The Bloor Group, Robin has written scores of white papers, research reports and columns on a wide range of topics from database evaluation to networking options and comparisons to the enterprise in transition.

7 Responses to "Hadoop: Is There a Metadata Mess Waiting in the Wings?"

  • Geoffrey Malafsky
    October 1, 2013 - 10:25 am Reply

    As always, a well thought out and written commentary. However, I will add a few relevant issues. First, regardless of the storage structure, the planned coordination, correlation, and accuracy of the data is much more important. The issues of what structure is best is really an outdated problem. In the days of costly and limited storage, CPU, and network capacity it was imperative to carefully plan and manage those fragile resources to meet current and anticipated needs. Hadoop and its extended family are really a very well engineered free version of parallel processing and storage ideas that used to be part of NSF grants. Thanks to Google, Yahoo, and the others ! With it comes several important use cases: 1) massive data loads able to use the low cost and reliably engineered distributed processing framework; 2) sand boxes for playing with the notion of Big Data with or without juice boxes and Twinkies; 3) using the easy parallelism for standard computational intensive functions like integration and analytics to lower cost, improve speed, and enable greater flexibility to business changes. #1 is the core use case of Hadoop. #2 is the source of the current fad and interest. #3 is a viable, justifiable way to embed the technology into existing corporate environments. Neither 2 or 3 will be of much use without connection to already “problematic” governance and rationalization efforts. Thus, the data structure should be chosen less for technical reasons, since storage and CPU and network are cheap and getting cheaper, but for a combination of adaptability (low cycle time), visible linkage to business issues -and- terminology, and good HW and SW performance. Tracking provenance and pedigree and activity in metadata has been a core precept of highly secure data processing (e.g. Intelligence) for quite some time and is a fixture of standard metadata schema. The notion that enough accurate metadata can be affixed to a very large amount of data (whether unstructured or structured) has been tried and shown to be too difficult for quite awhile. Semantic enthusiasts always trot out the same prototype – give me 1B more and it will…. story and there is little to nothing to show for these 15 years of projects, besides some giddy folks in love with the notion of automated machine semantic analysis going beyond well studied and calibrated data sets.

  • David Eddy
    October 7, 2013 - 7:39 am Reply

    Interest in systems metadata arose in the mid 1960s as database engines were being developed & deployed. Someone observed that these coming newfangled database things would be so complex & have so many components that it would be necessary to have a separate Bill-of-Materials database to keep track of the actual database application.

    Thus was born the Data Dictionary—products like Software AG’s Predict, Cullinett’s IDMS IDD, IBM’s IMS DD, Manager Software Product’s DATAMANAGER. [Note: the Data Dictionary label morphed to Metadata Repository with IBM’s short lived & abortive AD/Cycle effort. They’re the same thing. Whether the Dictionary’s storage mechanism is Inverted List or DB/2 is totally irrelevant.]

    Initially there was a great debate over “active” vs “passive.” Active (the IDMS’s IDD), meant you automatically documented your work by going through the Dictionary. You could not bypass the Dictionary. Therefore the Dictionary & technical documentation (what are the several definitions for a piece of data, what are the connections between components… e.g. basic Bill-of-Materials knowledge) would always be in sync.

    A Passive Dictionary meant documentation in the Dictionary was a separate task. People being people, getting around to documenting a running application in the Dictionary became secondary & the Dictionary was invariably out-of-date & therefore seen as a waste of time.

    A few organizations did automate the process of keeping a Dictionary accurate, but emphasis on the few. We’re talking a couple of dozen in North America.

    The gold standard of Data Dictionaries sold some 900 licenses in North America from 1973 onwards. By 2003 that number was down to perhaps 50. Systems, in the meantime, have clearly exploded in number & complexity.

    It would appear that organizations are willing to take the hit of constantly rediscovering basic documentation… “Where is Component X used?” “What are the many synonyms for this piece of data?”

    Certainly doesn’t sound as though the HDFS/Hadoop toolkit addresses any of this “trivia.”

  • Alec Sharp
    October 7, 2013 - 3:07 pm Reply

    Another excellent column from Robin that demonstrates the clarity of his thinking. The observation that “key-value stores … used to be called ISAM files” was so perfect I didn’t know whether to laugh or cry. It reminded me of sitting in a presentation (an excellent presentation, BTW) a few years back about semantic web structures and suddenly realising “Hey, these are the LISP structures I was coding almost 40 years ago.” CARs and CDRs. That in turn reminds me of the classic Gary Larson cartoon in which a cow munching grass suddenly blurts out “Hey, wait a minute! This is grass! We’ve been eating grass!”

    There have clearly been terrific benefits from Big Data initiatives, Hadoop included, but a bit of historical perspective is helpful. See David Eddy’s comments above on data dictionaries (whoops – “metadata repositories”) for another illumination of lessons lost. Per his note that “organizations are willing to take the hit of constantly rediscovering basic documentation,” I’ve earned a bunch of money over the last decade or so reverse-engineering business meaning out of undocumented data structures.

    Unlike the cows, maybe I’ll just shut up and be a contented cow knowing there will always be work for me to do.

    Thanks, Robin.

    Cheers,
    Alec

  • Ian Oliver
    October 8, 2013 - 9:01 am Reply

    ISAM??!! You’ll be telling me that this is all 1970s style batch processing with COBOL next! Oh, wait…

    Working in privacy, the issues I see auditing systems for said compliance is actually understanding what data they are holding. Often it is the case that the project teams themselves don’t understand their databases and log files sufficiently. Worse is that misinterpretation of the NoSQL and BigData approaches have left us in a situation where schemata can be forgotten. As the original poster stated “and that’s fine as long as you don’t lose track of what the data in the lake actually is” … most of the time the lake is shrouded in fog

    http://ijosblog.blogspot.fi/2013/07/big-metadata.html

  • Fware28
    October 8, 2013 - 10:06 am Reply

    Hi

    Nice article and comments but a lot of journalism glossing over real issues….

    For starts, HADOOP isn ‘t “free”.
    50 Pb of server-disk-network infra costs multi-millions CAPEX and OPEX can be considerable given data & project PMO governance, tech/admin problems associated with non-industrial technologies… that iff the Production data center people have facilities with Mwh & cooling required !
    Let’s not even factor in salaries of FTE and speciist consultants (biz & tech) to the operational overhead…
    multi -year Roadmaps.

    It ain’t throw away neither 😉

    So many initial Big data sites are run outside of Corp production by outsourced IM specialistists. And you provide the Hadoop & data scientists in Build & Run mode…
    Keeping their talents given current market conditions for these skill groups is also cosly too.

    Moral of the story, where there’s Gold in those nuggets there ‘s a darn good business case and deep pockets somewhere. No free lunch. Innovation is investment driven..

    So Open Source ain’t free lunch time sharing for tech or business 😉

    cheers

  • Robin Bloor
    robinjamesbloorgroup
    October 8, 2013 - 11:30 am Reply

    This blog post never claimed that Hadoop was free, only that it was open-source.

    In my experience (at the coal face), the reality of most IT investment (at the full project level) is that the software is really a small part of the whole bill. This is often the case even with “big ticket” software – especially if you calculate the bill over the life of the system being built (and you ought to if you are trying to set costs against benefits in any meaningful way). It rarely gets to be more than a few percent of the full system cost.

    So at that level, I can only agree with you. The idea that Open Source significantly reduces the cost of building a “big” system is unlikely to be verified in practice. Nevertheless there are advantages to rich software ecosystems in such projects.

  • David Eddy
    October 10, 2013 - 6:41 pm Reply

    There was something that Michael—Mr Column Store—Stonebraker said in a presentation a couple of years ago that has been rattling around in my brain.

    One other person “confirmed” that there is significant similarity between inverted list DBMSs (DATAMANGER, Adabase, M204?, S2000?) and the more “modern” column stores.

    Can someone with more hands-on knowledge of such plumbing confirm or deny?

Leave a Reply

Your email address will not be published. Required fields are marked *