Based on an Inside Analysis webinar sponsored by Dremio “Fast and Flexible: Interactive BI Arrives”
Written by Lindsay Conger with Deidre Hudson
“Then the Data Lake Evaporated into the Cloud”
The cloud has emerged as a tremendous resource not only for storing data but for analyzing data. As an extension, the Multicloud environment is the next step in cloud computing and is here to stay. It is used in almost every department in a company. From marketing to HR to sales automation, multicloud caters to a variety of environments with different apps. The amount of information in the cloud is astonishing, but the problem now is how do you manage a multicloud environment, especially from an analytical perspective?
What fueled this innovation?
Hardware is usually where innovation starts; Based on trends in hardware, performance is accelerating and costs continue to fall:
- Memory is at least 1000x faster than spinning disks but used to be very expensive. The huge drop in costs allow companies to focus exclusively on in-memory architecture, which is perfect for leveraging a cloud-based environment.
- Gamers fueled a tremendous surge in investing in GPUs which are excellent for machine-learning.
- Parallelism – changing nature of information management allowing multiple processes to take place simultaneously
Innovation Equals Access
The software world never stands still though. New innovations are constantly being introduced, and we see this with the triumph of open source as a business model. Now, if you want to develop software, you use open source first. The data cloud is dominating and we are seeing the costs collapse down to a more easily accessible point. The software world is now a cloud-first environment.
However, this doesn’t mean the cloud isn’t without flaws. Data analytics is a multi-disciplinary end-to-end process, not an activity. Being able to access the data in real time is essential for the data to be useful. Now, we are starting to see the emergence of scalable technology and Open Source tools. The future is here, it just isn’t widely accessible yet. In tech pockets like Silicon Valley or Austin, Texas, there are huge leaps of innovation in technology. This technology is slowly trickling down to the rest of us as the cloud democratizes this access to data.
Data Chain Problems
ETL (extract, transform, load) is slowly becoming obsolete as new cloud-native tools and platforms arrive on the scene. Data needed to become more accessible but that meant more ETL data scripts and batch process. With so many data sets floating around, it was nearly impossible to keep all the data under one data warehouse.
Then, data lakes were introduced and are now much more effective at keeping all the data in one place. This process is not cheap, but it is improving. Companies build data lakes because of the flexibility, cost model, elastic, scalable, and data science workloads. A lot of effort had to go into the data we put in the warehouses but now, with the data lake, it is much more flexible. It allows us to take data in its raw form and keep it in a more flexible place.
Data scientists and data engineers are capable of using any APIs available in the data lake. However, the BI users (who outnumber data scientists/engineers) cannot use standard tools like Tableau, Power BI, or Excel in the data lakes. The data needs to be transformed so it could actually be used and will be higher quality and cleaner for the BI users.
ETL tools work specifically with data prep tools so that they shape and refine the data. They do this through a data catalog so they can see what data is in the data lake in its raw and curated form. But the data lake is still slow. So BI users need BI acceleration, which is cubing technology that speeds up the process, and ad-hoc acceleration which gives them more speed for ad-hoc queries. Once all these have been combined together, programs like Tableau, Power BI, and Excel can be run.
Dremio: The Missing Link
With data lakes becoming increasingly complex, there had to be a better way to close that gap. Dremio is here to fill that missing link. Dremio is a fundamentally different approach, and it works with any data lake, BI, or Data Science tool. It has 10X-1000X data acceleration, a self-service semantic layer, zero-copy data curation, is elastic, scales to 1000+ nodes, and is open source. Dremio runs directly in the data lake and provides acceleration capabilities, self-service, and semantic layer so users can describe data directly in their own terms (while maintaining that the data lineage is automatically tracked). This is a completely new tier in data analytics.
Most companies don’t have a central catalog where all their data is kept. You can connect a cluster over multiple cloud services. But with Dremio, you can connect it to any of the sources, whether it is a relational database, MongoDB, or Hadoop. Dremio automatically recognizes schema and categorizes it and builds an index so it is searchable. When working with Dremio, where the data is held is no longer important and access to the data is instantaneous.
Before, IT would have to create dataset for you. This weeks or month-long process could mean the data is obsolete by the time it got to you. Also, as IT creates more data sets, it could mean increased security risks. Using Dremio allows you to be able to create data sets more effectively and quickly. It also makes changing the query easy and interactive. All changes are done in a virtual context without having to write code or move any data.
Dremio has acceleration capabilities that are not available in data virtualization technologies. Dremio can work with the Mongo DB, S3 and many other modern sources of data that were designed for relational databases. Data virtualization technologies are IT products and therefore are not self-service products like Dremio, making them much more expensive to market. Without Dremio, it would be impossible to quickly analyze data that is joined between two systems.
Multifaceted programs like Dremio are like a highly versatile lens that allow you to look through data sets more effectively. As the information landscapes get broader and wider, we need simpler ways to view and explore them. By adding virtualizing components, Dremio has tackled one of the key requirements of software data business. It enables specific people to create their own view of their world without changing the source or moving data around. Dremio allows data to be analyzed at the speed of thought, making everyone more productive.
For more information on Dremio, visit dremio.com.