In today’s world, many companies are in the throws of creating a customer intelligent front office where the objective is to ensure that customers have the same personalized experience across both digital and traditional channels. Also more and more focus is going into digitalization with the introduction of digital channels and the Internet of Things (IoT) dominating the agenda.
Digital channel adoption has seen web, mobile and social commerce all being introduced into the enterprise as more and more people prefer to transact on-line from desktop and mobile devices and through corporate social network pages. This is creating huge amounts of so-called “digital exhaust” data, for example, clickstream data in web server logs and non-transactional shopping cart data often stored in NoSQL databases.
Watch author Mike Ferguson in The Briefing Room with Teradata’s Technology Director, Tho Nguyen, as they discuss the idea of bringing analytics to the data. Register here.
Clickstreams record everything we click on with a mouse or via touch of a mobile device screen. There is a record for every click, with all this data held in Web log files. That means we can precisely track every visitor’s navigational path through each web site. We can see what pages they looked at, where they went next and much more. Obviously the more visitors that browse a web site, the more the volume of clickstream data being generated grows. With tens of thousands or even more visitors it is not long before that can become becomes enormous.
Similarly, shopping cart data records a history of what you put into your shopping cart, what you took out, what you put in again, etc., all before you buy. That means we can see what products and services a visitor or customer might be interested in even if they didn’t actually buy it.
With respect to IoT, more and more products are now carrying sensors that emit data. Good examples include mobile phones with GPS sensors, watches, fitness wristbands, cars, industrial equipment, fridges, etc. The list goes on, and here too, the volumes of data being generated are huge.
Companies want this data because it can tell us much more that we can get from transaction activity data stored in data warehouses. They can learn new things from this new data, especially when they combine it with customer data.
How Do You Become a Data-Driven Enterprise?
The objective is clear. It is to become data driven such that company direction is driven by evidence-based insights produced by analyzing digital and traditional data. The objective is to see new opportunities that allow orgainzations to disrupt existing and new markets. Therefore, becoming data driven is about being led by data and analytics. The key question is how do you achieve this? What is it that you need to do to become data driven? How do you deal with the deluge of data now pouring into the enterprise? What kinds of analytical platform or platforms do you need? Should you use the cloud, on-premises systems or both? How do you maximize the potential of predictive and advanced analytics? How do you deal with the Internet of Things? How do you integrate Big Data into you existing analytical architecture? How do you stay agile in a world where data is becoming increasingly distributed and therefore harder to access? How do you handle data governance when data is scattered across OLTP, NoSQL databases, analytical RDBMSs, Hadoop clusters and other file systems? How do you overcome the potential chaos of business led, self-service BI and self-service data preparation? How do you harness shadow IT and turn it into citizen data science? It is a major challenge and it is difficult not to get overwhelmed by it.
Accommodate, Don’t Replace
Yet among the chaos, common sense must prevail. This is not about replacing what you have. This is about extending existing analytical environments to accommodate new data and new analytical workloads in order to produce new insights to add to what you already know. In the past, the data warehouse was the analytical platform. Today that analytical platform is way more than that. It includes:
- Real-time analysis of high velocity, live streaming data
- High volume ingest technologies
- New analytical data stores like Hadoop HDFS, Amazon S3 and NoSQL graph databases
- Technologies for scalable exploratory analysis on large volumes of internal and external multistructured data, e.g., Hadoop and Apache Spark
- Advanced analytics, e.g., machine learning, text analysis, graph analytics
- End-to-end data management, scalable ETL and bi-modal collaborative data governance
- A combination of self-service and IT-based development
- Simplified access to multiple analytical data stores
The Extended Analytical Architecture Ecosystem
This new extended analytical ecosystem (shown in Figure 1) is architecturally more complex because it now includes all of the above in addition to a data warehouse. However it has to function as if it is fully integrated. We need to:
- Manage scalable ingestion of data
- Create an organized data reservoir (potentially of multiple data stores) and manage it as if it was centralized even though it may be distributed
- Automate the cataloging and profiling of new data coming into the reservoir
- Introduce collaboration into data management to help classify data in the reservoir in terms of quality, sensitivity, business value
- Be able to refine data at scale via ELT processing by defining how we want to transform and integrate data independently of where we want those jobs to execute (e.g., in Hadoop, in DW staging tables, in the cloud, etc.)
- Encourage agility through exploratory data science sandboxes, self-service data preparation and self-service analysis
- Integrate self-service data integration with enterprise level data integration initiatives tools to enable bi-modal data governance
- Provide transparent access to data and insights, produced by data scientists, irrespective of whether that data is in a data warehouse, in Hadoop, in a live data stream or a combination of all of these; also irrespective of whether that data is on-premises in the cloud or both. The way this can be achieved is via SQL and data virtualization to create a virtual Logical Data Warehouse layer across all multiple underlying analytical data stores irrespective of whether they are relational or Hadoop based.
At present we are in the middle of building this new extended analytical ecosystem and even though some software components are still incomplete, many companies are already using technologies like Hadoop and Spark to build new analytical applications. Critical success factors include business alignment (i.e., making sure candidate projects are aligned with strategic business goals), increasing automation so that you don’t have to write code to prepare and analyze data, an information catalog to govern what data is available for reuse, and organizing for success to enable citizen data science to rapidly produce new insights.