Go to Top

What Hadoop Is. What Hadoop Isn’t.

The Hadoop stack is a data processing platform. It combines elements of databases, data integration tools and parallel coding environments into a new and interesting mix. The problem with the IT market today is that it distorts the view of Hadoop by looking at it as a replacement for one of these technologies. Database vendors see it as a database and challenge it on those grounds. Data integration vendors see it as an ETL tool and challenge it on those grounds. Analytics vendors see it as a replacement for their engines and challenge it through that view. In doing so, each vendor community overestimates Hadoop’s potential for displacement of their product, while simultaneously underestimating the impact that it will have on the environment and architecture they operate in.

As Hadoop adoption grows, some vendors think they can subsume or contain it. This is as unlikely as Hadoop replacing databases and data warehouses, the bet other vendors are making. Some wish it would just go away so they can keep doing what they’ve always been doing. This is a natural state of affairs for any technology that offers both new capabilities and capabilities that overlap with existing technologies and products. The reality of the market is that the technology needs to settle in the areas where it offers new capabilities or more effective or efficient replacement of the old. Vendors with products in areas of significant overlap need to integrate in new ways, extend their own tools or risk oblivion.

Watch Mark Madsen and Hortonworks in The Briefing Room

Database vendors, being the central technology of most data architectures, felt the early brunt of the Hadoop market. With the arrival of Hive (a SQL interpreter that compiles SQL into an Hadoop job), the data warehouse seemed to be under direct assault. Hive offers a SQL interface to a freely available storage and processing platform. What it doesn’t resolve is aspects of a database catalog, strong schema support, robust SQL, interactive response times or reasonable levels of interactive concurrency − all things needed in a data warehouse environment that delivers traditional BI functions. In this type of workload, Hadoop doesn’t come close to what a parallel analytic database can achieve, including scaling this workload into the Petabyte range.

Yet Hadoop offers features the database can’t: extremely low cost storage and retrieval, albeit through a limited SQL interface; easy compatibility with parallel programming models; extreme scalability for storing and retrieving data, provided it isn’t for interactive, concurrent, complex query use; flexible concepts of schema (as in, there is no schema other than what you impose after the fact); processing over the stored data without the limitations of SQL, without any limitations other than the use of the MapReduce model; compatibility with public or private cloud infrastructures; and free, or support-only, so a price point far below that of databases.

Data integration vendors pick up on the processing angle and see Hadoop as an impoverished ETL tool. It’s essentially processing over a file system, a step back to the mainframe days of batch processing, albeit with parallelism added. It lacks user interfaces to make programming simpler and more accessible to a broad range of people. It lacks any concept of data management, except for anemic metadata support in the form of an optional catalog. All are important features that a data integration tool or platform offers.

Yet Hadoop offers the same scale and cost benefits here. It essentially replaces the engine of ETL with something that is extremely low cost and high scale, provided you know how to code for it, which few do. This is a different style of coding, more like old school data processing than application processing, and takes some training. As open source languages and ETL products start using Hadoop as the engine, ETL vendors are at significantly more risk than database vendors, but the primitive nature of the platform limits its deployability in most IT shops. The lack of important data management features means that it can’t simply replace a data integration platform outright, any more than a database can be replaced outright.

One advantage Hadoop has over data integration tools is that it’s accessible to a variety of programming languages, which means it can be used for any arbitrary parallel coding, like complex analytics. The vendors in the analytics market view Hadoop as a primitive version of their products. It has no user interface, no real data management, no direct analytic capabilities. Any analytics run in Hadoop must be manually coded or integrated from various libraries or run via third-party tools. There are no visual components, tools or interfaces, only additional projects to integrate on top of Hadoop. Hadoop doesn’t offer a tenth of what SAS or SPSS does.

Yet Hadoop offers things the analytics platforms don’t. Scalability over large data volumes at low cost is the element most often touted. The flexibility of the platform offers far more benefits. There are many libraries of code for common, and many not so common, algorithms. Different programming languages can be used to manually code any new algorithm or integrate different libraries. It takes most vendors more than a year to add a handful of new techniques to their products. The combination of open source analytics projects and Hadoop means it is usually available in this environment first. The scalability and ability to process and transform data means that much more can be done in this environment than can be done in an analytics product. The challenge is that it takes more technical expertise and for most common practices that means more time and expense than simply buying a product.

What’s overlooked by all of these vendors is that the Hadoop stack is a processing platform. It combines data storage, retrieval and programming into a single highly scalable package. This marriage of capabilities is what makes Hadoop unique. It’s possible to duplicate old tasks in any domain, but probably not as easily or efficiently. It is, however, possible to combine those tasks in new and interesting ways, run them over data that was never accessible in traditional environments and deliver them in a new architecture better suited to the distributed nature of cloud environments, public or private.

Hadoop provides new capabilities and overlaps with old. We will see a gradual shift of some workloads away from database, integration tools or analytic packages. Other workloads will stay in place because they are best served by those platforms. The hard part for IT and data architects today is understanding what parts of their workloads should move, and how to integrate the systems to coordinate data movement and processing. 

Mark Madsen

About Mark Madsen

Mark, president of Third Nature, is a former CTO and CIO with experience working in both IT and vendors, including a stint at a company used as a Harvard Business School case study. Over the past decade Mark has received awards for his work in data warehousing, business intelligence and data integration from the American Productivity & Quality Center, the Smithsonian Institute and TDWI. He is co-author of “Clickstream Data Warehousing” and lectures and writes about business intelligence, emerging technology and data integration.

Mark Madsen

About Mark Madsen

Mark, president of Third Nature, is a former CTO and CIO with experience working in both IT and vendors, including a stint at a company used as a Harvard Business School case study. Over the past decade Mark has received awards for his work in data warehousing, business intelligence and data integration from the American Productivity & Quality Center, the Smithsonian Institute and TDWI. He is co-author of “Clickstream Data Warehousing” and lectures and writes about business intelligence, emerging technology and data integration.

12 Responses to "What Hadoop Is. What Hadoop Isn’t."

  • What Hadoop Is… and Isn’t « The Trouble With Software (and what to do about it)
    December 9, 2012 - 8:43 pm Reply

    [...] Mark Madsen writes about Hadoop on his Inside Analysis blog, “The Hadoop stack is a data processing platform. It combines elements of databases, data integration tools and parallel coding environments into a new and interesting mix. The problem with the IT market today is that it distorts the view of Hadoop by looking at it as a replacement for one of these technologies.” [...]

  • Shalin Shah
    December 12, 2012 - 11:38 am Reply

    Big Data refers to the massive amounts of highly-structured and loosely-structured data that is both “at rest” and “in motion.” The analysis of Big Data presents tremendous opportunity to gain competitive advantage through better business and customer insight. However, most Big Data approaches are only able to analyze Big Data when it is at rest (i.e., persistent data). This means that only a fraction of the available data is analyzed, to the exclusion of the insights that could be derived from Big Data in motion (i.e., streaming data). Big Data in motion includes data from smart grid meters, RSS feeds, computer networks and social media sites. Agile organizations require insight into all available data sources. Even more so, they need these insights in time to gain a competitive advantage.

    Operational Intelligence provides real-time insight into both Big Data at rest and Big Data in motion. An Operational Intelligence platform analyzes Big Data from a wide variety of sources including Web feeds, legacy applications, and of course, Hadoop implementations.

  • The Data Juice
    January 23, 2013 - 7:28 pm Reply

    [...] Read the full article at Inside Analysis » [...]

  • Dan Linstedt
    February 6, 2013 - 10:43 am Reply

    Hi Mark, Excellent insights once again. Thank you for this informative information. I agree with your statements and comparisons. I’d like to add an a point to this discussion if I may…

    Processing Time.

    I would like to state that today’s analytical departments expect sub-second response times to ad-hoc queries. This simply doesn’t happen when attempting to get data from a Hadoop+Hive platform. It takes just a bit longer to a) generate the code for the ad-hoc request b) run the code across the distributed platform c) correlate and return the information.

    Businesses need to understand that Hadoop generally do not return sub-second responses to ad-hoc queries. Hence the vendor space providing a hybrid NoSQL (not only SQL) solution set (like Cloudera, Cassandra, and others).

    I would also like to point out that tools like Pentaho and Informatica have begun to address the Big Data space by offering GUI development layers for generating “push down” Hadoop code based on a drag and drop object design. These are wonderful advances, and will certainly add longevity to the traditional ETL tooling. But yes, I agree with you, part of the value proposition of Hadoop is to process unstructured and semi-structured data sets. Many of these ETL tools are highly structure fixed – and must adapt to the unstructured nature in order to survive.

    Just my two cents,
    Dan Linstedt

  • Loretta
    March 1, 2013 - 9:31 pm Reply

    It’s really a great and helpful piece of information. I am happy that you just shared this helpful info with us. Please keep us up to date like this. Thank you for sharing.

    Feel free to surf to my website :: Loretta

  • Jinsu
    April 17, 2013 - 3:01 am Reply

    I’m a student with interest in big data and Hadoop. This article summarizes many of grey area regarding big data analysis and has helped me understand what Hadoop really is. Thank you for sharing your insights!

  • Raghu
    September 2, 2013 - 8:11 am Reply

    Very useful analysis. Thank you.

  • Pierce Lamb
    October 8, 2013 - 12:24 pm Reply

    Podding off some of the comments here; I agree that enterprises have a need to cheaply analyze operational data. The emerging ‘in-memory data grid’ + Hadoop technology is meant to address this need. The idea is to combine all the in-memory computing innovations of the past decade with Hadoop technology so you can write your MapReduce code once and analyze both historical, static data and live, fast-changing data. Since in-memory data grids can connect to most data sources and can store data as CRUD-able objects in the grid, one can now run their standard MapReduce code on operational data.

    Some of the products meeting or on their way to meeting this need now:

    ScaleOut hServer (disclaimer, I’m a ScaleOut employee):
    http://www.scaleoutsoftware.com/hserver/

    GridGain Hadoop Connector:
    http://www.gridgain.com/products/in-memory-hadoop-accelerator/

    Terracotta BigMemory Hadoop Connector:
    http://blog.terracotta.org/2013/04/02/hadoop-bigmemory-run-elephant-run/

    Pivotal:
    http://blog.gopivotal.com/products/in-memory-data-grid-hadoop-integrated-real-time-big-data-platform-previewed-at-springone-2gx-2013

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>