The data lake has become the driving force of strategic IT in many companies. The heart-felt cry is: “We want the company to be data driven!” Now, if you want to be pedantic you can argue convincingly that IT has always been data driven in one way or another, but the promise of the data lake is far more beguiling than the idea of digital automation. The glittering possibility is that the executive team, ably assisted by a data scientist or two, will delve into mountains of data and, after a voyage of exploration and discovery, reimagine the business, then steer it towards a brighter future.
The Disheartening Hindrance of Legacy Technologies
What stopped anyone from doing this before?
The truth is, the technology wasn’t up to it. But this is not a simple truth; it has multiple dimensions. You cannot blame the mathematicians. Almost all the fancy statistical algorithms and machine learning techniques that we hear so much about nowadays were invented long ago. For the sake of brevity, let’s just list what the main blocking factors were to exploiting these mathematical delights:
- The hardware was too slow (and too expensive)
- Databases did not scale well (and were too expensive)
- Very little software existed for managing very large volumes of data (and what there was came at a high price)
- The fast configuration of hardware, software and data wasn’t feasible, until the advent of the cloud
We can attribute the possibilities of Big Data to three things: the advent of commodity hardware, the open source software of the Hadoop ecosystem that has been built to run on in a powerful parallel manner and the cloud.
Let’s fast forward to now and examine what has become possible. The data lake was a big idea whose time quickly arrived. It heralded a far more comprehensive and fluid data world. Very quickly big companies began deploying such lacustrine wonders, then various software pioneers began trying to integrate their Hadoop-Hive-HBase implementations with Spark. They pioneered the marriage of stream processing with Big Data batch processing.
And thus, the data lake became the repository of a great deal of data, both from within the corporation and from the burgeoning sources of external data – free government sources, specialist providers, social media sites and data brokers. A multitude of possibilities were in play here. Data that had previously been difficult or impossible to fit into data warehouses – so called unstructured data – could be dropped with ease into a data lake. Log file data, which had previously been too voluminous to handle, could happily flow into the data lake. The data lake could suck in batches of data but was equally happy to be fed by constant streams. The users were looking at a data resource that was far richer than they’d ever been able to contemplate.
Data self-service became much more prevalent. It is not entirely trivial to organize. Best practices demand that effective access management security, and where necessary, encryption, is in place. There may also be a need for metadata capture software and data cleansing software. However, the pay-off is significant.
The main dynamic of this is that the user no longer needs to go cap-in-hand to some IT developer to get access to data. In most organizations, there are limits to what can be held in a data warehouse and there may even be onerous procedures for getting at that data. To add new data sources to the data warehouse would often be prohibitive. The difference with a data lake can be startling. The data lake is, or should be, a single staging area for new data within the organization. It is extensible. Even if the capacity of the large cluster it occupies becomes saturated, with the judicious use of Kafka and another cluster, a kind of “data lake mart” can be set up for a department or workgroup. This is similar to the idea of a data warehouse and data marts, but there is a big difference.
Data lakes are inexpensive – at least the hardware and software is inexpensive – and that means that the ROI is inevitably going to be far greater than with any data warehouse-based initiative. The reality is the users love such arrangements. Given a sensible configuration they get control of the data they care about for the first time ever. What they choose to do with it varies. Some may simply want to feed spreadsheets and personal BI tools with data. Others may want to initiate collaborative BI activities. Yet others may want to simply explore the data and, possibly with the assistance of consultants, do analytical investigations.
The difference it makes to the user is so significant that it is difficult to compare to the legacy data warehouse world with all its inflexibility. The difference it makes to the data scientist is even greater, since he or she can bring the full force of powerful machine learning and AI techniques to the data.
The Potential Obstacles to Next-Gen Nirvana
So that’s the big data self-service story and it is appealing. But the data lakes still have to be built and managed – on premise or in the cloud. Applications still have to be configured and run. There needs to be a mechanism for defining data streams. The data security will not suddenly manifest out of thin air.
Despite the fact that the base technology is inexpensive, organizations “spend” their way through big dollars trying to make big data and data self-service a reality. The money goes to contract engineers, consultants, integrators and service providers. In fact, the statistics suggest that although the big data and analytics market is already consuming over $100bn per annum, more than half of that goes to services. Historically, this is as we should have come to expect from a relatively new IT market. Certainly, it was that way with the advent of database, the advent of websites and the advent of mobile applications. The initial market is heavily dependent on skilled consultants of various sorts.
What usually happens is that software vendors emerge who automate a great deal of the work that consultants cover. This is what is happening with the data lake. One company we spoke to recently, Galactic Exchange, focuses in this area. It enables its customers to deploy clusters easily and quickly (on premise or in the cloud), to manage security, to set up data streams and deploy applications. It even has an embedded app store which provides ready-made applications that can be deployed at the click of a mouse.
There are other companies entering this market as well. Indeed we expect this to be an area of focus in 2017 – not just the adoption of data lake technology, but the use of ready-made frameworks so that data lakes can be easily deployed and quickly provide benefits.