If you don’t track the evolution of Hadoop, then you may not be aware of YARN. But if you have an interest, you should be. YARN is the major innovation in Hadoop 2.0, which is now available.
Here’s a brief summary:
The original Hadoop was a marriage of the MapReduce environment with the HDFS (the scale-out key value store that is, in our view, responsible for Hadoop’s popularity). Prior to YARN, MapReduce was the primary means of getting at HDFS data and developing Hadoop applications. YARN changes this. It allows you to use Hadoop’s storage and cluster management capabilities without going through MapReduce, although of course, you can still do that. YARN provides an accessible resource management layer over the HDFS.
What this means in practice is that Hadoop is no longer just a MapReduce-based batch environment. You will be able to run many applications on it concurrently. The goal is to be able to cater for streaming applications (for example, data being analyzed and acted upon as it is streamed into Hadoop for storage and later use), interactive applications (for example, OLTP usage) and the Big Data applications (extensive queries against high volumes of data and associated analytics workloads). And this makes genuine sense, because all these distinct categories of application could easily have an interest in the same data.
Incidentally, in case you had ever been concerned about it, Hadoop 2.0 also removes the often mentioned single point of failure. It needed to be done if Hadoop systems were going to offer a production level service.
YARN Plus Tez
We also need to mention another Hadoop component, Tez, which is in beta now and expected to be available in Spring 2014. Tez, which is Hindi for “speed,” provides a customizable framework for low latency and high throughput workloads. It will enable (human) intractive response times for Apache Hive and Apache Pig. In effect, any SQL query can be expressed as a single job using Tez, and Tez will allocate resources to address the workload in order to optimize speed of response or throughput. Think OLTP applications on Hadoop.
Hortonworks and Actian
YARN and Tez may strengthen Hortonworks’ position in the Hadoop distribution business, although it’s still “early days,” with YARN only just being released, and Tez still waiting in the wings. We need to see how the Hadoop community reacts. Nevertheless Cloudera’s Impala may become a sideshow, although it will no doubt be integrated with YARN.
Commercial vendors will, in our opinion, pile in on top of YARN and Tez. They long ago realized that Hadoop wasn’t going to go away, and now they can get to HDFS data through YARN.
Soon after YARN’s release, Actian announced a partnership with Hortonworks. Actian has an obvious interest by virtue of its ParAccel Dataflow, ETL and data cleansing offerings. It has been quick to partner with Hortonworks, to integrate its ParAccel Big Data Analytics Platform with the Hortonworks Data Platform. The goal is to provide a foundation for high-performance analytics directly over Hadoop in an end-to-end manner. A Hortonworks/Actian reference architecture is the immediate outcome of this collaboration. In practice this means that the drag-and-drop (i.e., codeless) development and accelerated parallel performance provided by ParAccel Dataflow will be available directly on top of YARN, and later, Tez.
We expect other vendors to follow this lead. Hortonworks has a fairy extensive partner network and other vendors – Actuate, Datameer, Elasticsearch, Concurrent, RedPoint, Protegrity and others – have been quick to get involved.
The Destiny of Hadoop
When you think about it, Hadoop had a dramatic impact on the software industry despite the fact that it was a limited environment. It was bound to MapReduce and it only worked in a batch manner. While it could be employed as a “data lake,” an ETL environment and an analytics environment, it was very constrained in every one of these roles. Even so, it spawned a healthy ecosystem – not just of open source components, but also complementary commercial products.
With Hadoop 2.0 we expect this ecosystem to grow like bamboo in spring time.