The Hadoop stack is a data processing platform. It combines elements of databases, data integration tools and parallel coding environments into a new and interesting mix. The problem with the IT market today is that it distorts the view of Hadoop by looking at it as a replacement for one of these technologies. Database vendors see it as a database and challenge it on those grounds. Data integration vendors see it as an ETL tool and challenge it on those grounds. Analytics vendors see it as a replacement for their engines and challenge it through that view. In doing so, each vendor community overestimates Hadoop’s potential for displacement of their product, while simultaneously underestimating the impact that it will have on the environment and architecture they operate in.
As Hadoop adoption grows, some vendors think they can subsume or contain it. This is as unlikely as Hadoop replacing databases and data warehouses, the bet other vendors are making. Some wish it would just go away so they can keep doing what they’ve always been doing. This is a natural state of affairs for any technology that offers both new capabilities and capabilities that overlap with existing technologies and products. The reality of the market is that the technology needs to settle in the areas where it offers new capabilities or more effective or efficient replacement of the old. Vendors with products in areas of significant overlap need to integrate in new ways, extend their own tools or risk oblivion.
Database vendors, being the central technology of most data architectures, felt the early brunt of the Hadoop market. With the arrival of Hive (a SQL interpreter that compiles SQL into an Hadoop job), the data warehouse seemed to be under direct assault. Hive offers a SQL interface to a freely available storage and processing platform. What it doesn’t resolve is aspects of a database catalog, strong schema support, robust SQL, interactive response times or reasonable levels of interactive concurrency − all things needed in a data warehouse environment that delivers traditional BI functions. In this type of workload, Hadoop doesn’t come close to what a parallel analytic database can achieve, including scaling this workload into the Petabyte range.
Yet Hadoop offers features the database can’t: extremely low cost storage and retrieval, albeit through a limited SQL interface; easy compatibility with parallel programming models; extreme scalability for storing and retrieving data, provided it isn’t for interactive, concurrent, complex query use; flexible concepts of schema (as in, there is no schema other than what you impose after the fact); processing over the stored data without the limitations of SQL, without any limitations other than the use of the MapReduce model; compatibility with public or private cloud infrastructures; and free, or support-only, so a price point far below that of databases.
Data integration vendors pick up on the processing angle and see Hadoop as an impoverished ETL tool. It’s essentially processing over a file system, a step back to the mainframe days of batch processing, albeit with parallelism added. It lacks user interfaces to make programming simpler and more accessible to a broad range of people. It lacks any concept of data management, except for anemic metadata support in the form of an optional catalog. All are important features that a data integration tool or platform offers.
Yet Hadoop offers the same scale and cost benefits here. It essentially replaces the engine of ETL with something that is extremely low cost and high scale, provided you know how to code for it, which few do. This is a different style of coding, more like old school data processing than application processing, and takes some training. As open source languages and ETL products start using Hadoop as the engine, ETL vendors are at significantly more risk than database vendors, but the primitive nature of the platform limits its deployability in most IT shops. The lack of important data management features means that it can’t simply replace a data integration platform outright, any more than a database can be replaced outright.
One advantage Hadoop has over data integration tools is that it’s accessible to a variety of programming languages, which means it can be used for any arbitrary parallel coding, like complex analytics. The vendors in the analytics market view Hadoop as a primitive version of their products. It has no user interface, no real data management, no direct analytic capabilities. Any analytics run in Hadoop must be manually coded or integrated from various libraries or run via third-party tools. There are no visual components, tools or interfaces, only additional projects to integrate on top of Hadoop. Hadoop doesn’t offer a tenth of what SAS or SPSS does.
Yet Hadoop offers things the analytics platforms don’t. Scalability over large data volumes at low cost is the element most often touted. The flexibility of the platform offers far more benefits. There are many libraries of code for common, and many not so common, algorithms. Different programming languages can be used to manually code any new algorithm or integrate different libraries. It takes most vendors more than a year to add a handful of new techniques to their products. The combination of open source analytics projects and Hadoop means it is usually available in this environment first. The scalability and ability to process and transform data means that much more can be done in this environment than can be done in an analytics product. The challenge is that it takes more technical expertise and for most common practices that means more time and expense than simply buying a product.
What’s overlooked by all of these vendors is that the Hadoop stack is a processing platform. It combines data storage, retrieval and programming into a single highly scalable package. This marriage of capabilities is what makes Hadoop unique. It’s possible to duplicate old tasks in any domain, but probably not as easily or efficiently. It is, however, possible to combine those tasks in new and interesting ways, run them over data that was never accessible in traditional environments and deliver them in a new architecture better suited to the distributed nature of cloud environments, public or private.
Hadoop provides new capabilities and overlaps with old. We will see a gradual shift of some workloads away from database, integration tools or analytic packages. Other workloads will stay in place because they are best served by those platforms. The hard part for IT and data architects today is understanding what parts of their workloads should move, and how to integrate the systems to coordinate data movement and processing.