Inside Analysis

The Long and Winding Road to Open-Source Analytics

How do you eat an elephant? One bite at a time! That’s the advice afforded by optimists everywhere, and it might just be particularly appropriate in the world of open-source analytics. Hadoop — represented by that cute elephant of Doug Cutting’s family fame — embodies not just a platform for analyzing data. In fact, it has grown to incorporate an entire ecosystem of open-source projects which collectively threaten the status quo in the data-driven world.

The long-standing center of gravity in the universe of data remains the enterprise data warehouse. Prodigious, expensive, carefully wrought — this remarkable feat of relational engineering has fed business analysts for decades. Like all legacy software, however, the traditional data warehouse has gotten somewhat long in the tooth. And perhaps more critically, many data warehousing programs have tended to calcify, with politics often driving policy.

 That story is nothing new. More than a decade ago, visionaries like Luke Lonergan of Greenplum and Foster Hinshaw of Netezza identified what became a massive opportunity: data warehouse appliances. Yours truly interviewed both gentleman at that time, and they were decidedly sanguine about the opportunity they were addressing. When asked what would prevent someone else from building such a solution, Lonergan told me: “Nothing.”

 There was another visionary who was hot on the trail of something exciting back then. Dr. Michael Stonebraker of Vertica was actively promoting something considered esoteric at the time: column-oriented database technology. Though Sybase IQ had been in production since 1995, the market was still dominated by the relational databases offered by Oracle, IBM and Microsoft. Stonebraker told me in a 2005 interview that the “one size fits all” mantra would soon bend to a much more diverse field of database products. Boy, was he right!

 Stonebraker’s Vertica went to to capture serious market share in the analytics space. Now part of the HPE family, it continues to deliver high-powered analytics in just about every industry. The value proposition is simple: columns contain data that is similar, and can thus be compressed much more effectively than rows. This creates the speed necessary to crunch numbers quickly and effectively, thus enabling analysts to ask hard questions and get answers in rapid fashion.

There was yet another set of visionaries in the mid-2000s working on an idea that had been gestating since at least the early 1990s: open-source software. Pioneered by Linus Thorvalds with his Linux operating system, open-source was long considered the activity of hobbyists, not serious programmers. IBM changed all that by investing a billion dollars in the platform, largely as a means of circumventing Microsoft, whose ever-shifting sands of new operating systems were causing Big Blue (and everyone else) serious headaches.

By the mid-90s, the Apache Web Server hit the scene and quickly became a hot ticket. By 2005, it had scored bragging rights as the most popular Web server on the planet. That’s when this journalist first appreciated the power of open-source, while researching for an article that would later be published in The Public Manager, entitled Citizen Auditors: Web-Enabled, Open-Source Government. Within a year, even the federal government got open-source religion, enacting the Federal Funding Accountability and Transparency Act of 2006.

By 2009, the Apache Web Server earned the distinction of powering 100 million Web sites. But that was just the beginning of the Apache steam engine. Fast forward a few years, and the Apache Software Foundation now finds itself at the epicenter of enterprise software innovation. As of press time, the Apache Web site counts more than 300 open-source initiatives, including 172 committees managing 287 projects.

The most well known project would have to be Hadoop, bestowed upon the open-source community in 2007 by Yahoo! The man who helped conceive this big idea, having read two groundbreaking papers by Google, was Doug Cutting. His vision was not lost on another Silicon Valley luminary, Mike Olson, who in 2008 co-founded Cloudera.

Well, giant oaks from tiny acorns grow. What began as an upstart with attitude quickly grew into a data-driven powerhouse, today employing well over 1,000 people, while partnering with 1,200 firms, including those listed in this article. By 2010, Cloudera was already grabbing headlines as the new kid on the block. At the time, many in the data world conflated Hadoop with the data warehouse. This misconception continued for several years, and still somewhat lingers, even though the two are remarkably different entities in many ways.

The data warehouse was built around a number of constraints. Analysts realized that doing queries on live operational systems was a non-starter. Platforms like Enterprise Resource Planning solutions from SAP, IBM and Oracle were simply not designed to enable ad hoc analysis. Their design point was focused on operations, not analytics. Another solution would be required to enable effective analysis.

That’s where the idea of the data warehouse originated. Back then, processors were relatively slow, parallel computing was largely an academic exercise, storage was rather expensive, and moving data was a costly and brittle process dominated by Extract-Transform-Load (ETL). Still, the demand for useful analytics drove most large organizations to invest millions of dollars in building Enterprise Data Warehouses.

By the time Hadoop entered the scene, the constraints of old had largely dissipated. Processors were much faster, even on commodity boxes; parallel computing was front and center, having been mastered by Google’s MapReduce paradigm; and a new architectural approach to data had begun to unwind the hitherto stranglehold that ETL held on the data-driven world. We were at the beginning of a new age in data management, dubbed by many as the era of Big Data.

Nonetheless, the data warehouse did not — and likely will not — subside in importance. Rather, today’s forward-looking companies are realizing that Hadoop and the data warehouse form two components of a very powerful data strategy. The warehouse still serves as the curator of trusted, certified “small” data; while Hadoop provides context via a wide range of additional information sets, drawn from the world of Big Data.

And no discussion of open-source analytics would be complete without mentioning two highly popular programming languages: R and Python. Formally launched in 1997 as a spinoff from the S programming language (which dates back to Bell Labs in the mid 1970s), R quickly became a favorite of statisticians in universities and research organizations. Python, meanwhile, got its start in the late 1980s, and has caught fire in the last half-decade as a powerful means for creating applications using a syntax that is more intuitive and thus readable by programmers.

 The challenge for today’s analysts is therefore to assemble the right combination of technologies — combined with a solid team and the appropriate data sets, of course. Being able to synthesize the available tools in a way that makes sense is a key to success. Which brings us back to Cloudera. As one of the major Hadoop platform players, Cloudera focused from the get-go on leveraging this new platform for data management. They realized that companies would need a solution flexible enough to incorporate the open-source innovation, but stable enough to deliver the kind of service levels that the data warehousing world enabled.

 In a recent InsideAnalysis interview, Cloudera’s Sean Anderson noted the company’s view of Hadoop as largely complementary to the data warehouse. “We see Apache Hadoop and Cloudera’s capabilities to really extend a traditional warehouse environment to handle a lot of new and more complex types of data, and also fill in some of the gaps around some of the current limitations of modern architectures.”

The vision is therefore to curate and deliver a package of functionality that allows companies to flesh out their data strategy, using the warehouse as a cornerstone to the information architecture. By remaining heavily committed to many of the Apache open-source projects, Cloudera can keep its eye on the innovation ball, while at the same time solidifying its core platform. The task at hand is evolving, and the pace of change may actually increase over the next couple of years.

As veteran Analyst Dr. Robin Bloor recently noted, the wave of innovation will continue. The disruption caused by open-source is still fully in motion. Stay tuned…

Eric Kavanagh

About Eric Kavanagh

Eric has more than 20 years of experience as a career journalist with a keen focus on enterprise technologies. He designs and moderates a variety of New Media programs, including The Briefing Room, Information Management’s DM Radio and Espresso Series, as well as GARP’s Leadership and Research Webcasts. His mission is to help people leverage the power of software, methodologies and politics in order to get things done.

Eric Kavanagh

About Eric Kavanagh

Eric has more than 20 years of experience as a career journalist with a keen focus on enterprise technologies. He designs and moderates a variety of New Media programs, including The Briefing Room, Information Management’s DM Radio and Espresso Series, as well as GARP’s Leadership and Research Webcasts. His mission is to help people leverage the power of software, methodologies and politics in order to get things done.

Leave a Reply

Your email address will not be published. Required fields are marked *