Inside Analysis

Containing Big Data

Dr. Geoffrey Malafsky

CEO, Technik Interlytics LLC,

Chief Data Scientist, The Bloor Group

Container technology is one of the popular new Development and Operations (DevOps) approaches. As is typical in Information Technology (IT), it’s capabilities and market rise are described as some form of meteoric (on the low end) to world changing (the moderates). After more that forty years as a scientist, technology internal functionality reviewer (hardware, software, algorithms), Big Data appliance developer, enterprise data software developer, and complex data processing solution provider, I usually take the excited utterances and divide by ten to reenter the world of engineering reality.

Even with this down sampling, it is true that containers have penetrated the “I want to use this” industry common perspective very rapidly[1]. Evidently, there is something substantial behind the boisterous street hawkers’ claims. What is it? It centers on the intuitive meaning of the word container and the slow, complicated, laborious way applications have to be carefully tuned to the specific runtime environment of hardware and software each time they are moved or something is changed. As enterprises moved from the very slow procedural approach to application development and deployment to a faster and riskier use of rapid market offerings the onerous marriage of environment to new applications had to improve.

Any faster DevOps approach has to embrace the critical need for security. Security has multiple facets. One is cybersecurity relating to unauthorized access of data and systems. Another is security of operational harmony, meaning we don’t want to crash servers either in production or development when making new or changed applications. Neither do we wish to forgo the benefits of new applications. This is where containers have an excellent fit.

Containers are actually Linux containers (LXC) which share the Linux OS kernel but isolate applications and their requisite libraries (versions) and configuration information[2]. It is worth repeating to the frenzied IT community that containers did not just show up one day from the brilliant programming of a handful of people. Rather, it has a long history[3] involving many people, companies, ideas, failed approaches, and incremental capability just like all technology. A container eases the burden of developing and testing in small computers and then moving them into full scale test and production[4]. While containers use open source technology, they have been popularized by Docker who emphasize lightweight probability especially relative to Virtual Machines (VM)[5]. As their diagram, below shows, VMs enable multiple virtual machines per physical machine each acting as a full system sharing abstracted hardware. In contrast, containers share the same OS but enable separated applications and specialized libraries.

Figure 1  Comparison of containers and virtual machines from [4].

By itself, container technology is an important DevOps capability. However, it intersects with several other important technologies influencing both capability and strategy in corporate data management. These are:

  1. Big Data: this is a group of technologies focused on the capabilities to handle extremely large quantities of data in storage, processing, and query at speeds common to prior generation data technology handling moderate data set sizes. This includes Hadoop and its composite group of tools, in-memory computing, parallel computing, and Machine Learning
  2. Cloud environments: these hosted and managed computers provided in cluster mode and virtualized levels of storage and functionality. It is essentially the next step is outsourced computing taking advantage of the Big Data and other cluster computing technology, as well as service support for technology refresh, development and operations (DevOps), and cybersecurity. It specifically takes advantage of the tremendous growth of widespread high-speed internet networking and high Quality of Service (QoS).
  3. Low cost business models: Service providers do not have to offer very low prices, often free, to use advanced computing resources but this has become the most common approach to online supplied services. This makes it very cost effective to shift to internet supplied capabilities instead of on-premise only.

For Cloud, the new emphasis is for Hybrid Cloud architectures with seamless management among distributed data and applications in multiple Cloud services and on-premises data centers.  A Hybrid Cloud should allow exploiting the best characteristics for a given organization’s need and be flexible to changes over time. The predominant concern with putting data into a public Cloud is security and its associated risks. These risks include well publicized data breaches with its large remediation costs, as well as other more mundane issues such as managing access and preventing inadvertent corruption of work products due to the easier self-service nature of Cloud platforms. Other issues pertain to both Cloud and on-premise: data and application architectures; network, application, data security; risk mitigation. Cloud, as well as well managed internal data centers, should lower costs for technology modernization, infrastructure management, and reduce the need to maintain highly expert computing staff. These objectives are the basis of the new Open Hybrid Cloud Community Initiative[6].

  • Phase 1: Containerization of HDP and HDF workloads with DPS driving the new interaction model for orchestrating workloads by programmatic spin-up/down of workload-specific clusters (different versions of Hive, Spark, NiFi, etc.) for users and workflows.
  • Phase 2: Separation of storage and compute by adopting scalable file-system and object-store interfaces via the Apache Hadoop HDFS Ozone project.
  • Phase 3: Containerization for portability of big data services, leveraging technologies such as Kubernetes for containerized HDP and HDF. Red Hat and IBM partner with us on this journey to accelerate containerized big data workloads for hybrid. As part of this phase, we will certify HDP, HDF and DPS as Red Hat Certified Containers on RedHat OpenShift, an industry-leading enterprise container and Kubernetes application platform. This allows customers to more easily adopt a hybrid architecture for big data applications and analytics, all with the common and trusted security, data governance and operations that enterprises require.

Containers are an important part of the Hybrid Cloud plan. Note they are mentioned specifically in both phases 1 and 3. There are two directions for coupling Big Data and containers:

  • Hadoop runs inside containers with multiple containers comprising the Hadoop cluster
  • Applications deployed by YARN run inside containers (I.e. Docker containers instead of the built-in YARN containers)

The Hybrid Cloud plan and several DevOps approaches have addressed the first item for Hadoop inside containers. A good overview of one organization’s Lessons Learned (LL)[7] on this issue is available and it is worth while studying their warnings of intricate technical challenges. Completing the technology components for this to be production quality will be a key part of achieving the seamless hybrid architecture.

The second approach is availing the benefits of containers within the benefits of the Big Data cluster where data storage and compute resources are enabled at very large scale and managed for parallel operations. This is having Docker containers run as YARN managed jobs with each of the containers having the combined capabilities of isolated and independent applications as well as controlled access to HDFS files and node CPU and memory. This is an evolving technology with Apache deploying early phase capabilities while tracking development with the Open Source JIRA tickets[8]. A primary issue being addressed is security control where the container has access restrictions preventing it being a vehicle of larger system attacks while allowing cluster wide communications and data access to meet application needs[9].

Running Docker containers via YARN can be used for multiple applications such as HBase, web servers, and custom applications[10]. Yet, there are many issues to address to ensure these applications run properly, exploit the functionality of Hadoop clusters, do not introduce severe security problems, and do not cause cluster wide problems if they experience crashes. Much work has already been done such that complicated applications like SPARK can successfully run[11]. However, there are more items needing work and which are being tracked in YARN development, such as: software defined networking; timeline logging; docker profiles; user management[12]. These development issues are well described by several developers themselves[13].

  • OS stability
  • Fat (I.e. Uber) containers and microservices
  • State (_ful or _less)
  • Networking

One final point of clarification. There are frequent discussions about containers being stateless meaning they are self-contained without needing to store specific information for future actions and can be shut down without causing problems with other applications or services. In truth, real applications are stateful and even web applications are increasingly so to provide better interactions to humans who can be both forgetful of what they have done before and unwilling to look for this information elsewhere. Hence, YARN Docker containers will need to support cluster wide stateful applications.

[1] A. Murthy, «Introducing the Open Hybrid Architecture Initiative,» 10 09 2018.  Available:

[2] T. Phelan, «Lessons Learned Running Hadoop and Spark in Docker,» 29 09 2016.  Available:

[3] Apache, «YARN-3611: Support Docker Containers in LinuxEcutionExecutor,» 15 06 2015. Available:

[4] Apache, «Launching Applications Using Docker Containers,» 13 11 2018.. Available:

[5] S. Kumpf, V. Vavilapalli та S. Buragohain, «Trying out Containerized Applications on Apache Hadoop YARN 3.1,» 16 05 2018.  Available:

[6] M. Shivaprasad та M. Muralidharan, «Containerized Apache Spark on YARN in Apache Hadoop 3.1,» 24 05 2018.  Available:

[7] Apache, «YARN 8472: YARN Container Phase 2,» 28 06 2018.  Available:

[8] B. R. a. S. Kumpf, «Containers and Big Data, DataWorks Summit,» 20 06 2018.  Available:

[9] P. Rubens, «What are containers and why do you need them?,» 17 06 2017.  Available:

[10], «What’s LXC?,»  Available:

[11] T. Hildred, «The History of Containers,» 28 08 2015.  Available:

[12] Redhat, «Understanding Linux Containers,»  Available:

[13] Docker, «What is a Container?,»  Available:

Leave a Reply

Your email address will not be published. Required fields are marked *