The Journey into a Parallel Universe
Maybe the last time you heard the word “parallel,” before all the Hadoop MapReduce ballyhoo, was in geometry classes at school. Nowadays, however, the word seems to pepper a good deal of tech talk. Remember when the words “scale out” suddenly became a feature of all new database products? That was a parallelism thing. And if you investigate it, you’ll find it was pre-Hadoop. It all began in earnest when x86 CPUs began having multiple cores in 2004.
Parallelism is in the natural order of things. The industrial revolution, i.e., automated manufacturing, was about pipeline parallelism. And the supermarket check-outs, another kind of parallelism, can be thought of as segmentation parallelism – an array of parallel processors that segment the workload. We knew about parallelism before there were any computers. After we invented computers we didn’t exploit it much. But now we can and now we will.
Specifically what we didn’t do was:
- Create programming languages or development tools that automatically parallelized the work
- Build genuine scale-out database products
- Build software architectures that took advantage of parallelism
This is now changing and in fact, it appears to be doing so rather swiftly. In our view the whole IT industry, willy-nilly is now going parallel. There are some products we follow partly because of their use of parallelism. We mention them below, one by one:
This development environment comes from Texas Multicore Technologies (TMT). The language/development environment SequenceL enables developers to write parallel applications without any need to reference the physical execution environment. You can think of this as “fine grain” software parallelism. The neat thing is that the developer can write code in a declarative high level language (that is Turing-complete) and it produces low level parallel code.
Intel Parallel Studio is Intel’s toolset for fine grain parallelism in C++/C or Fortran. This is not, in our view, as neat as the SequenceL IDE, but it’s Intel and therefore it’s being used fairly extensively. Parallel Studio includes many components: Intel Parallel Composer, Intel Parallel Advisor, Intel VTune Amplifier and Intel Parallel Inspector. They clip in fairly neatly with Microsoft development environments and the Eclipse open source platform. You might not have noticed but this toolset first came to market in 2008. It has added a good deal of sophistication since then.
This product from Actian is also fairly mature. It used to go by the name of DataRush, but its name got changed when Actian acquired Pervasive (a branding thing). This is a high level development product for building analytic and data flow (ETL, cleansing, etc.) applications. The neat thing is that the developer doesn’t need to know anything about parallelism because it automates its parallel implementation.
4. HDFS (Hadoop and friends)
You cannot write about parallelism without mentioning Hadoop. Since the release of MapReduce 2.0 I have not written about Hadoop without mentioning YARN. It is a scheduling capability that enables multiple workloads to run at the same time on Hadoop. It would be important for that reason alone (Hadoop used to walk in handcuffs and it no longer does). However it also cuts the link between MapReduce and HDFS. This means that HDFS can exist as its own as a scale-out file system. We are inclined to believe that this changes the world. Think about it – a global file system that spans the Enterprise. This is what HDFS can become and we expect it will. We just have to watch to see if it does.
Since we mentioned YARN, it’s only fair to mention Cascading from Concurrent. Cascading is an application framework for developing data management and data analytics applications that can happily sit over Hadoop. While YARN is primarily a scheduling capability, Cascading is an in-depth management capability as well as an application framework – and it can work symbiotically with YARN. We have recently asked ourselves what, in the long run, will manage the Hadoop environment. It may well turn out to be Cascading.
This is a technology and company that I have been following for at least 3 years. This is not pointed at the Big Data space, but at the “Big Distribution” space. A very subtle way to take parallel advantage of compute power is to distribute workloads across many nodes of a network. Sounds simple, and indeed, the idea is simple; it’s the implementation of it that is hard. Anyway this is what EnterpriseWeb does. It’s a development and integration platform in the sense that it can both build new applications, and link existing applications and capabilities together. But it is also performs remarkably well – not just because it has been built for speed, but because it is truly distributed.
Pneuron is another distributed development platform which chooses to “take the processing to the data rather than taking the data to the processing.” It focuses, at the moment, on complex Big Data applications but can also do OLTP. One of the smart things about this particular product is that it monitors all the resources over which it is deployed and dynamically distributes its executable components to take best advantage of available resources.
8. DB Lytix
DB Lytix from Fuzzy Logix is technology you might be using but not know about, mainly because it is embedded in a number of analytic databases. It comprises hundreds of mathematical and analytics functions that can be called within a database and which will execute in parallel when called. The existence of this library has made it a good deal easier for databases to add fast analytic capability to their query engines.
This is another product I’ve kept my eye on for quite a while. It processes data streams in a neat way by running SQL queries against them – and by the way, it can handle unstructured data in this manner. The truly smart thing about the product is that it does parallelism right, mixing pipeline and data segmentation parallelism in the correct manner and thus executing extremely fast.
You’ve got to hand it to IBM with its Watson “cognitive computing” initiative. It’s an impressive integration of a whole set of different analytic and AI capabilities that may well change the face of computing as it evolves. It may be a little early to make such pronouncements, but what Watson does is unique, and by the way, it uses parallel processing extensively.
If you are wondering why I haven’t mentioned any databases among this list of 10 products it is primarily because databases were the only class of product that always did exploit parallelism. But there is another reason: there are far too many products that deserve a mention.