Inside Analysis

Continued Evolution of Hadoop – Spark, SQL, and Security

Change is inevitable. Change is all around us. In fact, it has been said that, “The only thing that is constant is change.” ― Heraclitus

In regards to the wonderful world of big data, this is no different. Hadoop is growing up. Hadoop is evolving. Hadoop is maturing to the point where it is no longer a niche solution for data scientists and academics. In areas of processing speed, data access, and security, Hadoop is becoming an enterprise offering that appeals to the CIO and CSO as well as the CFO and the CMO.

Check out this episode of The Briefing Room with John Myers and MapR

Spark Speeding Processing

Hadoop environments were “born” with the MapReduce processing engine. This processing engine was excellent at maximizing the commodity hardware common in Hadoop clusters and being able to pull apart a processing request—the “map” component—and process the information in relatively small chunks across a wide number of processors—the “reduce” component—and bring those smaller results back together again for the end result. This was a great model when we wanted to explore information stored in a Hadoop file system and make new discoveries. Speed and processing latency—the amount of time between when you hit “enter” and when you get your results back—are not core components of those processing loads.

However, when Hadoop was pressed into service for more low-latency processing workloads, it became more important to develop faster and more “iterative” processing engines. Apache SPARK has been the answer to this particular question for Hadoop. SPARK is designed to utilize an in-memory approach to processing that has advantages over MapReduce. SPARK can be much faster since it does not have to go to disk. SPARK also has the advantages over MapReduce in lessons learned associated with programming and implementation since the creation of the Hadoop ecosystem. Both of these components are providing Hadoop with a maturing set of processing engines and options for implementers on their choice for batch AND real-time/low latency processing.

SQL Lingua Franca of Business

EMA research says that approximately 50% of all data consumers from big data projects are business stakeholders. Whether that be line of business executives or business analysts, these business stakeholders want and need access to the information that resides within a Hadoop cluster. Often times the information is multistructured in nature, and the legacy Hadoop data access methods were aligned more with NoSQL data access via “programmatic” interfaces such as Java or scripting languages from the Hadoop command line or via the Hadoop’s Apache Hive project and its HQL query language. While Hive and scripts are excellent for a more technically-oriented data consumer, the business stakeholders in EMA research are more aligned with a SQL-based access methodology associated with data visualization and analytics tools. Many of these platforms utilize the JDBC/ODBC drivers to push queries to the data management platform and return the results in a structured format. And therein lay the rub . . . you had massive amounts of data within the Hadoop environment, but limited ways for business stakeholders to effectively access that information.

This created the “SQL on Hadoop” wave that we have seen recently with Impala from Cloudera; HAWQ, now HBD, from Pivotal; Presto supported by Teradata and others; and Apache Drill are just some of the solutions to the business stakeholder’s issue with command line and scripting access to information on Hadoop. Each of these solutions has various levels of support for the ANSI SQL standards that range from SQL-92 to SQL-99 to SQL-2003. Each one helps to bridge the gulf between data stored in a Hadoop file system and connecting that platform to the various data visualization and analytical platforms on the desktops of the line of business.

Securing the Future

As stated above, Hadoop began life as an interesting project to solve problems deep within the data center and—more often than not—to solve problems where there was a significant “barrier” between where the data was stored and processed and the individuals utilizing the end data. Yes, there were command line access points; however, for the most part, information stored on a Hadoop cluster required that information to be accessed, processed and then transferred for the information to be accessed. In deep data center environments, a security model based on the “old” Unix “user/group/everyone read, write, execute” security model is an acceptable solution. Yet, with the growth of the information going into a Hadoop cluster and the number of individuals and platforms accessing that information across a wide range of access points, it has become more important to secure the data on Hadoop more fully.

Hadoop has strengthened its approach to security across many different fronts, including:

• Network-safe authentication
• Network encryption
• Enhanced authorization using access control
• Tighter permissions on files on the local file system elements

And while Apache and Hadoop distribution vendors have more distance to cover before Hadoop meets all the needs (if not wants) of the CSO, Hadoop data is more secure today than it has been in the history of the project and can more easily meet those demands.

Conclusion

If you consider that most data management platforms have an approximately 40 year maturity curve, Hadoop is a “toddler” in its growth. Only 8-12 years into its development, Hadoop has come a long way in terms of development, acceptance and adoption. Some might say that Hadoop has grown faster than any previous data management paradigm during that time. Through developments in processing engines such as SPARK, data access layers that support the line of business’ visualization and analytical platforms and the security to keep all of this information from “spilling” into the wrong areas, Hadoop has matured to justify the acceptance and adoption that it has enjoyed.

John Myers

About John Myers

John Myers is Research Director of Business Intelligence at Enterprise Management Associates. In this role, John delivers comprehensive coverage of the business intelligence and data warehouse industry with a focus on database management, data integration, data visualization and process management solutions. John has years of experience working in areas related to business analytics in professional services consulting and product development roles, as well as helping organizations solve their business analytics problems, whether they relate to operational platforms, such as customer care or billing, or applied analytical applications, such as revenue assurance or fraud management.

John Myers

About John Myers

John Myers is Research Director of Business Intelligence at Enterprise Management Associates. In this role, John delivers comprehensive coverage of the business intelligence and data warehouse industry with a focus on database management, data integration, data visualization and process management solutions. John has years of experience working in areas related to business analytics in professional services consulting and product development roles, as well as helping organizations solve their business analytics problems, whether they relate to operational platforms, such as customer care or billing, or applied analytical applications, such as revenue assurance or fraud management.

Leave a Reply

Your email address will not be published. Required fields are marked *