Go to Top

Big Data, Huge Hype: Courtesy of Cloudera

If there were awards for IT marketing hype, then Cloudera would be hard to beat for the 2012 prize, even though there’s more than 6 months left in the year. To be fair, the company has the distinct advantage of being in both the Cloud business and the Big Data business, where well tended hype has become a natural part of the foliage, so a sprig or two of hype is to be expected. Nevertheless in my view, Cloudera has, with the embedded video that can currently be found on its home page, comprehensively outperformed the competition for the 2012 hype awards.

If you want the full multimedia experience, including the animation and the music, you’ll have to visit the web page itself. I provide a rough cut of the wording below.

You have 10 times more data than you did 3 years ago. But do you know 10x more about your business? Limitless insights. Variety, Velocity, Volume – permanently overwhelm traditional platforms. You’re only profiting from a fraction of your data. Apache Hadoop: A platform for all your data. Store process and analyze data in any format, any quantity, quickly. Free from rigid schemas. Perform operations in parallel. With maximum flexibility. Scale economically on your hardware of choice. Integrate seamlessly. (databases, OS, ETL, storage, security, management systems…) Helping you profit from all your data, understanding your customers, markets, products, operations, etc.

Taking it point by point…

You have 10 times more data than you did 3 years ago.

No we don’t. This is a huge exaggeration. Consider some recent estimates:

  • IDC estimates a CAGR of 40% for data storage. That would suggest less than 3 times as much data as three years ago.
  • Dell: Unstructured data (videos, sound, images, text) growth expected to continue at a compound annual rate exceeding 60% by some estimates. This suggests 4 times as much data as 3 years ago.
  • Sepaton: Data growth continues unabated with fifty percent of respondents reporting their data was growing from twenty to seventy percent annually and an additional twenty percent reported even higher annual growth rates. Without getting too mathematical, this suggests pretty much the same as data growth rate as Dell. So, 4 times as much data as 3 years ago.

Cloudera would have us believe that the data growth rate is over 115% per year. In truth, the reality is that some companies are experiencing huge growth rates of that ilk in data –  big companies like Amazon, Facebook, Twitter and so on. The telcos probably have big growth rates too. But such growth rates are the exception. By the way, ever heard of the bell curve (the normal distribution)? Well, the companies experiencing growth of this kind are in the thin tail at the right edge of the bell curve.

But do you know 10x more about your business?

Even if you had such growth rates, what the hell does this question even mean? It’s not about volume of information, it’s about the quality. Truth is, the valuable information about a business was the first thing that BI and Analytics were pointed at.  You always go after the low hanging fruit first. After that you probably get diminishing returns. But scaling out technology does make it possible to analyze data that was not analyzed before, so maybe there is some low-hanging fruit in Big Data. But please, 10x? Beyond ridiculous.

Limitless insights.

If they are truly limitless, we’ll never be able to get to access all those insights because there are far too many of them. We’ll be drowned in insights.

Variety, Velocity, Volume – permanently overwhelm traditional platforms. 

No they don’t. Traditional platforms can do well with variety. Not too bad at data velocity either, although when you have very fast data that needs analyzing immediately, then Hadoop is no good whatsoever. Neither are traditional platforms. You need a non-traditional CEP platform. As for volume, there are indeed volume challenges for traditional platforms, but the NewSQL and NoSQL databases seem to handle them well.

You’re only profiting from a fraction of your data.

True. If there is a single item of data anywhere that is not generating profit then this is true by definition, and it is probably so, and it will still be so if you deploy Hadoop.

Apache Hadoop: A platform for all your data. 

It absolutely is not a platform for all your data. This is such a ridiculous suggestion that it’s difficult to know where to start. Suffice it to say that companies are not throwing out their data warehouses and putting them onto Hadoop. Any company that is will, I expect, be searching for a new CIO after that projects fails.

Store, process and analyze data in any format, any quantity, quickly. 

Just downright irresponsible. Hadoop can do some things quickly because of parallel scale-out. It has been built to do one query at a time. A concurrency of one. How is that quick? It isn’t. HBase improves matters a bit, enabling a low level of OLTP and turning Hadoop into a BigTable column store, but it would be severely embarrassed by a traditional RDBMS for OLTP.  And any format? This makes it sound as though Hadoop has some wonderfully rich data structuring capability – and, as far as I know, it has nothing better than HBase. In any event it was always possible to accommodate any format in any database that supported Blobs, simply by storing the data as a Blob and getting at it programmatically. In other words it would not be defined in metadata. Sure, Hadoop can do that kludgy thing too. Almost anything can.

Free from rigid schemas.

Pretty much damning the idea of a schema – having previously stated that it can accommodate data in any format. This pretty much damns the idea of making data available to all applications through one of those horrible rigid schemas. So does Hadoop have a schema? Well if you implement Hive you get a metadata repository. One can only presume that what this provides is a flimsy, rather than a rigid, schema.

Perform operations in parallel.

I hate to break this to Cloudera, but databases have been performing operations in parallel for decades.

With maximum flexibility.

I don’t think so – if this is about parallelism. Hadoop provides data parallelism but not process parallelism. It doesn’t offer the maximum flexibility of parallelism. However, MapReduce is effective for data segmentation functionality – as long as you know how to program it. Oh, didn’t they didn’t tell you? MapReduce programming is not as simple as writing Excel macros. You need skilled staff. And nearly all the experienced ones have jobs.

Scale economically on your hardware of choice.

Hadoop is not the most economic platform for scaling. Period. You can pay more, true. You can also pay less.

Integrate seamlessly. (databases, OS, ETL, storage, security, management systems…)

The IT industry has never ever achieved seamless integration. Perhaps by “seamless” they mean it will seem less than you’d hoped for.

Helping you profit from all your data. understanding your customers, markets, products, operations, etc.

Yeah, yeah, yeah. But this is not as egregious as what precedes it, so we’ll let this one pass (you don’t profit directly from data unless that’s what you sell).

Bottom Line

In my opinion, Apache has done a wonderful job in assembling a functional set of Open Source components around Hadoop. It should be applauded. In its distribution, Cloudera provides all of the important components: Apache Hadoop, Flume, HBase, Hive, Oozie, Pig, Sqoop, Whirr, Zookeeper, DFS Module and the Hue Browser-based desktop interface.

Sadly, it has also added another component: Hadoop Marketing Hype. This component scales up and scales out impressively. We don’t know if the current version has ever been properly benchmarked, but we suspect that there is no limit to the number of outrageous claims that it can process and present to the unsuspecting customer.

,

8 Responses to "Big Data, Huge Hype: Courtesy of Cloudera"

  • Steve Ardire
    May 24, 2012 - 7:42 pm Reply

    Well done !

  • John Furrier
    May 25, 2012 - 9:27 am Reply

    Your misguided post actually complements Cloudera when you actually tried to trash them. They just hired a VP of marketing like a month ago. That hype you say is actually endorsement by the community and marketplace. Oh and you’re data is is wrong IDC numbers are not accurate. Wikibon.org has the latest accurate numbers.

    I discovered this post from our Hadoop system that I build – one gathering insights never before possible. Sir your post is total slander of Cloudera. Many points that you string together that appears as fact but is indeed total fiction.

    I could take this post and substitute “horse and buggy” to the upcoming new development of the “automobile”. This post is absolutely ridiculous. What’s worse is that you present it like you have actual inside knowledge of the business.

    Cloudera is the opposite of hype. Fact is they didn’t have a vp of marketing for over 1.5 years. 3 employees actually. You are mistaking what appears as “hype” with actual performance in the community. Cloudera is leading on all metrics in this new emerging category. Hortonworks was another company that tried big marketing tactics then backed down to just put out code and now they are performing well. The community keeps things honest not posts like this.

    I give you the points on the Datawarehouse market not going away soon (like OLTP etc) but the Hadoop economics suggest that disruption to those legacy systems both at a economic and technical performance level will be in market soon.

    • Robin Bloor
      May 29, 2012 - 2:50 pm Reply

      In response:

      You seem a little upset.

      The word, by the way is compliemnt (meaning to say something positive) rather than complement (meaning to add something that improves). I was careful only to criticize Cloudera marketing. And I didn’t so much compliment Couldera as compliment the Apache components that make up the Hadoop ecosystem. Apache has done well.

      If Cloudera has only just hired a VP of marketing then the marketing hype in that video is likely not his fault.

      Couldn’t find the figures you refer to on Wikibon.org. Did find a figure suggesting that the big data market is growing at roughly 34% most of which is a forward projection. Anyway that’s less than the IDC 40% growth rate I quoted. And by the way I quoted 3 different sources of data growth estimates – IDC being the least. I’d be very interested to see the Wikibon.org data growth statistic of 10x in 3 years, if such exists.

      Nonetheless, I did find the following sentence on Wikibon, which I find rather mystifying: “What’s critical to realize is that 35% more digital information is created today than the capacity exists to store it; and this number will jump to over 60% over the next several years.”

      If 35% more data is being created today than the capacity exists to store it, then where the hell is this data being created? You cannot create data without it being stored somewhere. In fact what does that sentence actually mean? I’ll not be using that site as an information source if that’s typical of its articulation.

      By the way, my post is not a slander of Cloudera, and never could be. Allow me to advise: the word slander refers to something spoken. The word you may have been searching for was “libel,” which refers to written words. Hopefully this piece of advice will help you improve your use of the English language.

  • Ellie Kesselman
    May 25, 2012 - 8:48 pm Reply

    Delightful reading!

    Flume, HBase, Hive, Oozie, Pig, Sqoop, Whirr, Zookeeper

    and

    MapReduce programming is not as simple as writing Excel macros.

    Guess what: A lot of the world still uses Excel for day-to-day analysis. I still don’t understand how a database can be designed and actually used without entity relationship diagrams and a data dictionary. Agreed, a metadata repository DOES seem flimsy. But no one ever talks about that. Or performance benchmarks of the dreary functional sort.

    Big data hype scales up VERY well though!

  • Andrew Wright
    May 29, 2012 - 8:09 am Reply

    Classic Bloor’isms (or perhaps they are Rbin’isms) & spot on the money!

    Great piece …

  • Phil Cooper
    May 30, 2012 - 6:44 am Reply

    The trouble is that far more people will watch the Cloudera video than will read this article and many of those will doubtless go on to engange with them in the belief that the company is as good as their marketing spiel.

  • Wayne Kurtz
    June 5, 2012 - 12:24 pm Reply

    Robin, I agree with you about the over the top nature of the Cloudera marketing pitch. In my experience however the really savvy IT decision makers have become almost immune to such pitches and actually have come to expect them. The louder the din the more strident a company has to pitch in order to be heard. Its true of politics and true of business which means someone has to understand what really can and cannot be done with specific products. Not an easy task but its how we earn our bread. Thank you for for insight I’m glad someone points out the absurdity of uncontrolled hype.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>