First it was punches on a paper card – chads. Next it was magnetic tapes files spinning on a reel. Then it was core, strung together on X,Y,Z wires. Soon there was disk storage. And then with miniaturization there were massive amounts of disk storage.
Along this progression, the cost of storage has shrunk and the processing power has multiplied exponentially. With this phenomenon the type of data stored and the uses of that data have changed dramatically as well.
In the earliest days, data was for simple applications such as accounts payable, accounts receivable, and payroll. Then online transaction processing became the norm and a whole new type of data appeared – repetitive transaction-based data. Soon archival data began to appear as a result of transaction data that was never purged. An entire new explosion of data occurred when log tapes, Internet business mapping, telephone exchange records, email, tweets and the like came into the corporate domain. The result today is a volume of data the likes of which was never imagined a decade ago.
This explosion of data volumes is something that has resulted in what we are calling “Big Data.” Some of the issues associated with Big Data are familiar. First, there is the issue of cost. Despite the continuing drop in the unit cost of storage, the rate at which data is being collected far surpasses the falling price of storage. So, even though storage is getting to be less expensive on a unit basis, the total cost of storage continues to go up and continues to be an issue. Then there is the issue of the technology that is required to manage and control the Big Data. Once uniprocessor technology was acceptable and even SMP technology sufficed. But with the volumes of data encountered today, raw shared-nothing parallelism is required. In a world where once 16 bit addressability sufficed, now with Big Data the world is facing the prospects of 64 bit, 128 bit, and even 512 bit addressability.
Within memory are the early programs written on early machines that required 1MB of memory. Those programs and the technology have long since passed into museums and dusty attics.
But with Big Data comes some challenges not encountered before. With Big Data comes the issue of irrelevant data. When one looks at Big Data it quickly becomes obvious that much of the data that has been captured has little or no business relevance. Take the trail left by an Internet session. The system is quite capable of tracking every squiggle and every movement of the cursor. These minute changes are captured. But what is the business relevance of those changes? Is there real value in knowing that a person’s hand accidently hit the cursor? Is there some hidden business meaning to the accidental slippage of the mouse? And in the email arena – there is spam and blather. Spam are those messages found in the email stream created by an outside agency introducing topics which have nothing to do with the business of the organization. And blather are those emails generated internally which have no business relevance. When a person emails his girlfriend – “Let’s go out on Saturday night” – that email has no relevance to the business of the corporation. It is blather.
Yet spam and blather are a standard part of the email stream, regardless of the fact that spam and blather contribute nothing to the ability of the organization to make better business decisions. Perhaps there was spam and blather in an earlier day and age, but it was so innocuous and so minute that it was never noticed. But in today’s world of Big Data, spam and blather are a very real part of Big Data.
So relevancy of data that lies within Big Data is an issue that really has not appeared before.
Lack of Repetition
Another issue is that of the lack of repetition of data found in Big Data. Once upon a time data could be nicely and neatly organized into repetitive records. One thinks of ATM activities or airlines reservations when it comes to repetition. In truth, one ATM activity is just like every other ATM activity when it comes to the structure of the data. The only thing that changes from one ATM activity to the next is the contents of the fields of data found within the record. The same is true for airlines reservations and many other records of data created by the execution of transactions.
While some of Big Data is repetitive, much of Big Data is not. Much of Big Data appears as text, and traditionally text is not repetitive at all. The challenge with handling text is that since text is not repetitive, it does not fit normally and naturally with standard database management systems. In addition, text has its own unique set of issues. For example most text is subject to the vagaries of terminology. In one part of the U.S. an object is called one thing. In another part of the U.S. the same object is called something else. If text is taken literally, there never will be the connection made that people are talking about the same thing. Take the term “broken bone,” for example. Doctors tell us that the term can be expressed in at least 20 ways. When analyzing Big Data there must be a means for resolving the terminology that is found in Big Data.
Another issue is that of the formality of text. Look at the text found in a text book and chances are good that an English teacher would give an A to the writer of the book for using proper spelling, correct punctuation, verbs, nouns, prepositions, and adverbs properly. But take a look at the notes that a doctor would take and the same English teacher would give the doctor an F. Why? Because doctors write their notes in their own shorthand. Doctors happily and willfully break all the rules the English teacher taught them. Yet doctor’s notes are a very legitimate form of text and cannot be discarded just because the English teacher doesn’t like what has been written.
The world of Big Data is full of new challenges and new opportunities. Some of the same old issues appear and other new issues have arisen. The challenges of Big Data are just now coming into focus.