Go to Top

YARN: Weaving the Future of Hadoop

If you don’t track the evolution of Hadoop, then you may not be aware of YARN. But if you have an interest, you should be. YARN is the major innovation in Hadoop 2.0, which is now available.

Here’s a brief summary:

The original Hadoop was a marriage of the MapReduce environment with the HDFS (the scale-out key value store that is, in our view, responsible for Hadoop’s popularity). Prior to YARN, MapReduce was the primary means of getting at HDFS data and developing Hadoop applications. YARN changes this. It allows you to use Hadoop’s storage and cluster management capabilities without going through MapReduce, although of course, you can still do that. YARN provides an accessible resource management layer over the HDFS.

What this means in practice is that Hadoop is no longer just a MapReduce-based batch environment. You will be able to run many applications on it concurrently. The goal is to be able to cater for streaming applications (for example, data being analyzed and acted upon as it is streamed into Hadoop for storage and later use), interactive applications (for example, OLTP usage) and the Big Data applications (extensive queries against high volumes of data and associated analytics workloads). And this makes genuine sense, because all these distinct categories of application could easily have an interest in the same data.

Incidentally, in case you had ever been concerned about it, Hadoop 2.0 also removes the often mentioned single point of failure. It needed to be done if Hadoop systems were going to offer a production level service.

YARN Plus Tez

We also need to mention another Hadoop component, Tez, which is in beta now and expected to be available in Spring 2014. Tez, which is Hindi for “speed,” provides a customizable framework for low latency and high throughput workloads. It will enable (human) intractive response times for Apache Hive and Apache Pig. In effect, any SQL query can be expressed as a single job using Tez, and Tez will allocate resources to address the workload in order to optimize speed of response or throughput. Think OLTP applications on Hadoop.

Hortonworks and Actian

YARN and Tez may strengthen Hortonworks’ position in the Hadoop distribution business, although it’s still “early days,” with YARN only just being released, and Tez still waiting in the wings. We need to see how the Hadoop community reacts. Nevertheless Cloudera’s Impala may become a sideshow, although it will no doubt be integrated with YARN.

Commercial vendors will, in our opinion, pile in on top of YARN and Tez. They long ago realized that Hadoop wasn’t going to go away, and now they can get to HDFS data  through YARN.

Soon after YARN’s release, Actian announced a partnership with Hortonworks. Actian has an obvious interest  by virtue of its ParAccel Dataflow, ETL and data cleansing offerings. It has been quick to partner with Hortonworks, to integrate its ParAccel Big Data Analytics Platform with the Hortonworks Data Platform. The goal is to provide a foundation for high-performance analytics directly over Hadoop in an end-to-end manner. A Hortonworks/Actian reference architecture is the immediate outcome of this collaboration. In practice this means that the drag-and-drop (i.e., codeless) development and accelerated parallel performance provided by ParAccel Dataflow will be available directly on top of YARN, and later, Tez.

We expect other vendors to follow this lead. Hortonworks has a fairy extensive partner network and other vendors – Actuate, Datameer, Elasticsearch, Concurrent, RedPoint, Protegrity and others – have been quick to get involved.

The Destiny of Hadoop

When you think about it, Hadoop had a dramatic impact on the software industry despite the fact that it was a limited environment. It was bound to MapReduce and it only worked in a batch manner. While it could be employed as a “data lake,” an ETL environment and an analytics environment, it was very constrained in every one of these roles. Even so, it spawned a healthy ecosystem – not just of open source components, but also complementary commercial products.

With Hadoop 2.0 we expect this ecosystem to grow like bamboo in spring time.

Robin Bloor

About Robin Bloor

Robin is co-founder and Chief Analyst of The Bloor Group. He has more than 30 years of experience in the world of data and information management. He is the creator of the Information-Oriented Architecture, which is to data what the SOA is to services. He is the author of several books including, The Electronic B@zaar, From the Silk Road to the eRoad; a book on e-commerce and three IT books in the Dummies series on SOA, Service Management and The Cloud. He is an international speaker on information management topics. As an analyst for Bloor Research and The Bloor Group, Robin has written scores of white papers, research reports and columns on a wide range of topics from database evaluation to networking options and comparisons to the enterprise in transition.

Robin Bloor

About Robin Bloor

Robin is co-founder and Chief Analyst of The Bloor Group. He has more than 30 years of experience in the world of data and information management. He is the creator of the Information-Oriented Architecture, which is to data what the SOA is to services. He is the author of several books including, The Electronic B@zaar, From the Silk Road to the eRoad; a book on e-commerce and three IT books in the Dummies series on SOA, Service Management and The Cloud. He is an international speaker on information management topics. As an analyst for Bloor Research and The Bloor Group, Robin has written scores of white papers, research reports and columns on a wide range of topics from database evaluation to networking options and comparisons to the enterprise in transition.

7 Responses to "YARN: Weaving the Future of Hadoop"

  • ramesh indraghanti
    November 25, 2013 - 1:23 pm Reply

    Robin,
    Very interesting to realise that YARN is what will truly take the hadoop platforms to support the transactional semantics of the EDWs ( you even take it to the OLTP world !!) with realtime high concurrency & low latency. I thought this was a reality that is 2- 3 yerars away but apparently not from your article !!

    It would be interesting to hear besides Horton which other commercial vendor offerings are on the Hadoop 2.0 to support/integrate YARN. Any thoughts?

    looking forward to more insights into YARN/Tez
    Thanks
    Ramesh

  • Matt Brandwein
    November 25, 2013 - 1:42 pm Reply

    Ramesh,

    Cloudera already includes YARN and Impala will run on YARN in CDH 5. Cloudera employees actively participate in the YARN and Hive communities as well. For true interactive SQL in production environments today, Impala has seen dramatic industry adoption, including among other Hadoop distributions such as MapR and Amazon EMR, and still maintains a 10X+ performance advantage over even the latest version of Hive.

    It’s worth noting that it’s been possible to run multiple non-batch workloads on top of HDFS for some time now; YARN simplifies the scheduling and resource management. Examples: HBase has been running online applications in Hadoop for years, and Cloudera Impala and Cloudera Search already run production workloads directly on data in HDFS. YARN will help further accelerate bringing new applications to shared data.

    Cheers,
    Matt

    • Eric Kavanagh
      Eric Kavanagh
      November 29, 2013 - 9:02 am Reply

      Thanks so much for providing some great detail! We look forward to getting a detailed briefing from Cloudera soon!

  • Murali
    December 2, 2013 - 9:33 am Reply

    “Nevertheless Cloudera’s Impala may become a sideshow,…”

    That’s a very strong line…. Thoughts Cloudera?

  • Amit
    December 12, 2013 - 4:23 am Reply

    Robin,
    Very good overview on YARN and Tez. In 2014 , we would have many such innovation and new concepts of Hadoop components. I think, we are moving in direction of replacing basic Database/ RDBMS with new technology which provides quicker or similar performance and integrity on Big Data.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>