The growth in data complexity and the high number of disparate datasets processed by organizations today make it imperative to choose wisely when it comes to gathering, storing, and analyzing data. Traditional approach to analytics and reporting has been to use data warehouses which store structured data for analytic processing. However, today we see a large volume and variety of data from disparate sources like web sites, social media, Mobile and IOT devices. This need to process ‘Big Data’ has led to the creation of Data Lakes.
What are the features that set Data lakes apart from data warehouses, and which of these is the right solution for your business needs? This post attempts to demystify the underlying concepts behind the two.
By definition, a data lake is meant to hold massive quantities of unstructured data which remains undefined until the data is extracted. Whereas, data warehouses are more of a schema on write system, one that is optimized for analytic processing, instead of transaction handling. Here are a few differences
In the case of data warehouses, a large amount of time is spent in analyzing, processing and profiling data. This is done mainly to simplify the model, conserve space and lower costs. This compromise leads to the exclusion of a large amount of data simply because it does not answer a specific question or finds place in a defined report.
A data lake retains all sorts of data whether structured, semi structured or non-structured, in large amounts. This is possible due to the difference in hardware between the two. Off-the-shelf servers combined with economical storage options make scalability a non-issue.
With data warehouses, the data consists of only quantitative metrics and the attributes that describe them. Of course, new types of data are being introduced but storing them in these warehouses is both expensive and difficult.
Data lakes on the other hand, embrace non-traditional types of data – such as web server logs, sensor data, social network activity, images, etc. All sorts of data, regardless of their source or structure are kept in their raw form, ready to be transformed when the need arises.
Data warehouses function like a repository, which means changing its structure is not an arduous task but definitely time consuming. Even a well-designed warehouse that is highly adaptable to change suffers because of the complexity of the data loading process rendering the system resourceheavy and slower to respond. The ever-increasing need for faster answers is what has given rise to the concept of self-service business intelligence.
A data lake, however does not possess a structure and since data is always accessible, developers and data scientist are empowered to easily configure and reconfigure models and apps on the fly. Constant changes – whether deletions or additions can be carried out without changing any structure or employing additional resources.
Larger user base
In an organization, there are three main sets of data users – Operational, analytic and modelers. Operational users make up a huge chunk of data users, often taking up around 80% of user base. They are confined to using data for mundane tasks such as viewing reports, key performance metrics, editing data on spreadsheets, etc. Analytic users, forming 10%, use the data warehouse as a source but often require additional data which is stored outside the organization. Unlike operational users, analytic users are required to go beyond data warehouses. Modelers differ greatly from these two – they delve into deeper analysis. This requires far more than what a data warehouse could possibly offer. Most often, these data scientists tend to surpass data warehouses and use advanced analytic tools and capabilities like statistical analysis and predictive modeling.
A data lake approach supports all three of these users equally well. The data scientists can go to the lake and work with the very large and varied data sets they need while other users make use of more structured views of the data provided for their use.
Since data lakes contain all sorts of data types in raw form, it enables users to get to their results faster than the traditional data warehouse approach. However, this promptness does come at a cost where the user is expected to transform, cleanse and structure the data as they see fit. This approach may mainly work with the aforementioned analytic and modeler type of users as operational users may not possess the capabilities or the acumen to deal with such large amounts of unstructured data.
While it may seem that a data lake approach is the way to go, you don’t necessarily have to do away with your current warehouse solution. It is important to note that the data lake is not a better version of a warehouse, nor is it a replacement. They are both optimized for different purposes, and the goal is to use each one for what they were designed to do. It is quite possible to use both in tandem
A data lake can be used for sandboxing, allowing users to experiment with different data models and transformations, before setting up new schema in a data warehouse or could also serve as a staging area, where data is supplied to a data warehouse.
By using both prospects suitably, it is possible to find the perfect solution to your data storage needs.