Inside Analysis

Getting Going With Hadoop and Spark, the ClusterGX Option

To my mind, a great deal of what has happened over the years (more than 10 years, as it happens) and more recently with Spark, was predictable and should have been expected. My perspective is this: what Hadoop – and the Johnny-come-lately Spark – brought to the IT world was the possibility of a scale-out server side “operating environment.” Of course, Hadoop wasn’t presented as such; the marketing message was all about huge scalability and the potential for big data.

But Hadoop once it landed and began to expand was a really low-level environment. And as a broad set of functionality it was incomplete in its early days, although it soon had useful components like Hive and Pig. It never had the rich environment that now surrounds it. And above all else, it was low-level infrastructure software, written by software engineers who were used to working at that level and knew how to make things work at that level.

Hadoop began life like MSDOS or the early IBM mainframe OS/360 – quite limited, but nevertheless useful. It was awkward because it required low level expertise. It could take months to set up and be able to manage even a small cluster, and it still can. The Hadoop distributors – MapR, Hortonworks and Cloudera – have done a great deal to simplify matters, and so have Amazon with EMR and Microsoft Azure with HDInsight. All have added Spark into the mix, which has proved dramatically popular, partly because it doesn’t employ or need MapReduce, and partly because it can be much faster than Hadoop. But it, too, requires a level of technical expertise.

ClusterGX

This is where we now stand. The infant distributed environments are growing up; they can walk on their own and even tie their own shoelaces. Still, Hadoop and Spark are distributed systems, and we never previously had to implement and manage such complex systems. The tasks of installation, configuration, resource management, capacity planning, maintenance and upgrade are distinctly different to what went before. There is a need for greater simplicity. This is what Galactic Exchange, Inc. provides with ClusterGX.

The goal, pure and simple, is to deliver the easiest and quickest possible deployment and management of container-enabled Hadoop/Spark compute clusters. On-premises or in the cloud, the ClusterGX platform can be deployed in minutes without any prior need for clustering, container or big data experience. In essence, it is a virtualized cluster compute platform based on Docker containers that comes with its own AppStore that links clusters and the data they manage to containerized big data applications that can be implemented as needed.

ClusterGX runs on x86 devices running Linux (Ubuntu or Centos), Windows or even MacOS. The way you implement is to open an account on the web site http://galacticexchange.io/ create your login credentials, then download ClusterGX. The set-up process installs Hadoop, Spark and the big data software tools you need, creating containers as necessary. The process takes minutes, and there is no effective limit to the size of a cluster that can be created in this way.

If desired, applications can be installed manually, but it is easier to launch them with a single click via the integrated AppStore. There is currently a growing selection of commercial applications integrated into the AppStore. In the near term the AppStore will also directly link to containerized application repositories such as Github – allowing direct access to tens of thousands of container-ready application micro-services which become more popular daily.

Irrespective of whether ClusterGX runs on-premise or in a preferred cloud compute provider, Galactic Exchange hosts the master node functions as a cloud service. Hadoop and Spark require master nodes to run on dedicated machines that are configured and maintained separately from the other cluster nodes. Galactic Exchange spins up these master node services automatically in its cloud service for each cluster that is created. The rest of the cluster, wherever located, is comprised of Docker virtualized containers. Whenever you launch a new application, ClusterGX spins up the required containers in parallel, allowing for all associated dependencies – not just Hadoop or Spark, but also Hive, Pig, Kafka, Impala, etc. The underlying technical details of this are invisible to, or better put, hidden from, the user.

Remarkably, ClusterGX can support different versions of Hadoop or Spark on the same cluster, which in practice removes the requirement for separate clusters or even separate virtual machines for applications that need different versions of Hadoop or Spark or that run better with different versions.

The Possibilities

ClusterGX is both easy to use and versatile, making efficient use of the designated resources through container virtualization. New clusters and new applications can be spun up and taken down with ease. It is thus an excellent environment for companies who are just commencing their journey into the big data world. They can start small and scale at their own pace. It is also a useful environment for Proof of Concept testing of specific applications, simplifying installation significantly and accelerating such projects through the review, test and deploy process.

Container virtualization scores heavily on efficiency, making it possible to run multiple containers in the same VM or bare metal server at no extra cost. It provides close control of the server infrastructure. The knock-on effect is reduced server costs or alternatively, reduced cloud costs. At the business level, it facilitates the sharing of access to clusters within the business, turning Hadoop/Spark and the associated applications into a company-wide asset.

What Galactic Exchange is doing was an inevitable evolution for Hadoop and Spark. Just as all the other dominant operating environments have evolved towards ease of use, efficiency and greater manageability, so must Hadoop and Spark.

Robin Bloor

About Robin Bloor

Robin is co-founder and Chief Analyst of The Bloor Group. He has more than 30 years of experience in the world of data and information management. He is the creator of the Information-Oriented Architecture, which is to data what the SOA is to services. He is the author of several books including, The Electronic B@zaar, From the Silk Road to the eRoad; a book on e-commerce and three IT books in the Dummies series on SOA, Service Management and The Cloud. He is an international speaker on information management topics. As an analyst for Bloor Research and The Bloor Group, Robin has written scores of white papers, research reports and columns on a wide range of topics from database evaluation to networking options and comparisons to the enterprise in transition.

Robin Bloor

About Robin Bloor

Robin is co-founder and Chief Analyst of The Bloor Group. He has more than 30 years of experience in the world of data and information management. He is the creator of the Information-Oriented Architecture, which is to data what the SOA is to services. He is the author of several books including, The Electronic B@zaar, From the Silk Road to the eRoad; a book on e-commerce and three IT books in the Dummies series on SOA, Service Management and The Cloud. He is an international speaker on information management topics. As an analyst for Bloor Research and The Bloor Group, Robin has written scores of white papers, research reports and columns on a wide range of topics from database evaluation to networking options and comparisons to the enterprise in transition.

Leave a Reply

Your email address will not be published. Required fields are marked *