Why is Hadoop so popular? There are many reasons. First of all it is not so much a product as an ecosystem, with many components: MapReduce, HBase, HCatalog, Pig, Hive, Sqoop, Mahout and quite a few more. That makes it versatile, and all these components are open source, so most of them improve with each release cycle.
But, as far as I can tell, the most important feature of Hadoop is its file system: HDFS. This has two notable features: it is a key-value store, and it is built for scale-out use. The IT industry seemed to have forgotten about key-value stores. They used to be called ISAM files and came with every operating system until Unix, then Windows and Linux took over. These operating systems didn’t provide general purpose key-value stores, and nobody seemed to care much because there was a plethora of databases that you could use to store data, and there were even inexpensive open source ones. So, that seemed to take care of the data layer.
But it didn’t. The convenience of a key-value store is that you can put anything you want into it as long as you can think of a suitable index for it, and that is usually a simple choice. With a database you have to create a catalog or schema to identify what’s in every table. And, if you are going to use the data coherently, you have to model the data and determine what tables to hold and what attributes are in each table. This puts a delay into importing data from new sources into the database.
Now you can, if you want, treat a database table as a key-value store and define only the index. But that is regarded as bad practice, and it usually is. Add to this the fact that the legacy databases were never built to scale out and you quickly conclude that Hadoop can do something that a database cannot. It can become a data lake – a vast and very scalable data staging area that will accommodate any data you want, no matter how “unstructured” it is.
So if someone in the company wants some external data or even internal data captured for later use, Hadoop can just sit there and drink it up. And that’s fine as long as you don’t lose track of what the data in the lake actually is. But this is where the devil crawls into the detail. You can scale Hadoop out so it becomes just one very large data lake and sits there gulping down all the data it can drink. You can also instantiate multiple instances of Hadoop, each devoted to a specific kind of usage, but we do not often hear about IT sites doing that – after all Hadoop scales out to the edge of the solar system, does it not?
Then we encounter the fact that quite a few IT sites use Hadoop as a natural storage location for intermediate files, and that can be convenient too. This happens quite frequently with data analytics, because some data analytics activity generates intermediate files and if you’ve decided to “go round the database” you may be happy to use Hadoop as the foundation of a general data analytics capability and foreswear databases forever or at least until you discover that you need them.
Revelytix and the Potential Metadata Mess
So there is a potential metadata mess that may be gradually building up due to hasty Hadoop exploitation. And like most IT problems, you may only discover that it’s out of control when it decides to sit firmly on your knee. I think it is a good idea to forestall such a possibility, and if you are going to do so, then you need to keep track of the data. You can do this manually, of course. You can declare and implement (with an iron fist) some Hadoop best practices. The problem with that, however, is that you may bring back some of the data latency that you hoped to diminish with Hadoop, and manual procedures have a terrible habit of generating errors.
There is another possibility: to automatically gather the metadata directly from Hadoop. At the moment I know of only one technology that does this in a comprehensive way, and it comes from Revelytix. Revelytix employs semantic technology, which gives it most of its power, especially when dealing with seriously unstructured data. In most cases it can work out what data is in a record without any help. Additionally – I found this quite surprising, but incredibly sensible – it tracks the provenance of data. This means it can know what data came from where. For professional data analytics this is more than just a time-saver, it’s a necessity.
So all is not lost. In fact, maybe nothing is lost. In fact, the cavalry may have arrived a little ahead of time. But if you don’t recognize this problem before it proliferates, I suspect it’s going to cost you real dollar$.