Bring structure to your data

Explaining the most commonly used forms of data storage

Article
Data Engineering
Data warehousing

Oskar van den Berg

Data Engineer

7 min

02 Feb 2024

There are many different forms of data storage. In practice, a (relational) database, a data warehouse, and a data lake are the most commonly used and often confused with each other. In this article, you will read about what they entail and how to use them.

What forms of data storage are there?

In practice, these three forms of data storage are most commonly used:

Relational Database

Typically referred to simply as a database, although there are also other forms of databases.
Often used for operational processes.
Data in a relational database is structured in a fixed data model and is frequently reused by other operational processes.
Therefore, it is crucial that the data is transcribed accurately and reliably.

Data Warehouse

A storage space for data not directly used by operational systems.
Has a structured method of storage and is primarily used for analytical purposes.

Data Lake

Shares many similarities with a data warehouse but provides more flexibility in how data is stored.
Essentially, a data lake is a "regular" file storage, supporting various file types.
Unlike the other two storage forms, a data lake does not have a built-in engine to process data.

What is a relational database?

A relational database is constructed according to a relational data model. Data is stored in tables, with rows and columns having a fixed structure (the 'schema'). Relational databases represent a traditional database structure.

The fixed structure ensures efficiency and consistency, resulting in high data reliability. Retrieving data from relational databases is done using Standard Query Language (SQL). Another characteristic of relational databases is their ability to be used by many users simultaneously.

The abbreviation OLTP is often associated with a description of relational databases. This abbreviation stands for 'online transactional processing.' The term 'transaction' indicates that relational databases excel in processing structured data, such as transactions from a store or online shop.

When to use a relational database?

When precision is crucial, such as processing transactions or financial data.
With smaller data volumes (i.e., not big data).
When you need the ability to make easy adjustments.

When not to use a relational database?

For data analysis, relational databases are often too slow.
When dealing with large data volumes: relational databases are not distributed databases, and therefore, they can only scale vertically by adding more storage to the machine itself.
If you require a flexible schema.
Relational databases are not designed to handle unstructured data. If your data comes in various formats (json, xml), a NoSQL database might be a better alternative.

What is a data warehouse?

A data warehouse is highly suitable for data analysis purposes. In a data warehouse, data from various sources is loaded and merged. While a data warehouse operates with a relational structure, it can contain diverse types of data and is specifically structured for queries and analyses.

Data retrieval often occurs in batches, and not every time a transaction takes place in a source system, as is frequently the case in a relational database.

Traditionally, data warehouses employ an ETL (Extract, Transform, Load) pipeline:

Extract – Data is retrieved from various sources (data lake, database, etc.).
Transform – Data from different sources is combined and transformed into a data model optimised based on your goals.
Load – Data from a data warehouse is often loaded into a BI or dashboarding tool.

The data model for a data warehouse must be designed to make it easy for users to work with the data and improve query performance.

Today, data processing is not always executed within the data warehouse itself but is applied using DBT (Data Build Tool). This technique involves storing all operations separately and largely agnostic from the data warehouse. It features interesting aspects, such as automatically conducting data quality tests and providing a quick and intuitive way of documentation. It also allows for the reuse of models and operations. However, you still use the query engine of the data warehouse and can leverage functionalities like the web interface and user management.

When to use a data warehouse?

When you want to combine multiple data sources.
When there is a need for a central location for a single source of truth.
When better performance for analyses is required.
When access to data needs to be centrally managed because all data is stored in one place.
With large data volumes.
When you have data that can be structured.

When not to use a data warehouse?

While a data warehouse is powerful and modern cloud data warehouses are often cost-effective, there are situations where it might not be the right solution:

Operational data, such as transactions, is often better suited for a relational database. However, a data warehouse can be the right solution for analysis of this data.
If the data is unstructured and does not fit into a tight schema, a data lake might be a better alternative.
If your data has different formats (json, xml), a data lake or a NoSQL database could be a better alternative.
For analytical purposes where timeliness is crucial, such as real-time analyses.

What is a data lake?

A data lake is a centralised storage where you can store both structured and unstructured data. Data lakes are capable of storing very large volumes of data with diverse structures.

A data lake can be compared to a folder on a computer where various files (text, CSV, JSON, photos, videos, audio, etc.) can be stored without the need for the data to be structured beforehand.

You can perform different types of analyses, create dashboards and visualisations, apply (big) data processing, and run machine learning models on data stored in a data lake. However, you do need a separate query engine for these operations. The convergence of these two technologies is referred to as a data lake house.

When to use a data lake?

When you want to analyse unstructured data.
With large volumes of data.
When you desire the freedom to explore datasets and conduct ad-hoc analyses (common among Data Scientists).
When you expect your data analysis use case to evolve in the future. A data lake is well-suited for storing data historically before processing it into a structured data model.
When you want to analyse data without inserting it into a pre-defined schema. With "schema-on-read," you can run queries on files without creating a table beforehand.

When not to use a data lake?

While data lakes are flexible and versatile, they may not always be the ideal solution for every scenario:

Operational data, such as transactions, is often better suited for a relational database due to data accessibility.
Data analysis on structured data is sometimes simpler and faster in a data warehouse with a clearly defined data model.
The unstructured nature of a data lake requires user-side structure to maintain an overview.

Storing data for analytical purposes? use a data model

When you intend to process data for analytical purposes in a data infrastructure, you need to store the data in a data model.

A data model can offer a lot of freedom, like storing files randomly on a laptop, or it can provide a structured order, such as in a database where all data is stored in a fixed schema.

The more freedom allowed in the storage method, the more challenging it becomes to process the data later for analytical purposes. Having a well-thought-out and organised data model is crucial for how the data can be used later. Queries that could have been simple may become very complex if the data model is not well-thought-out.

Data modeling is not a fixed process; it is iterative as new requirements and data are introduced. Flexibility is beneficial as new information becomes available.

The right data model for your organisation

The data model for a data warehouse can be developed by data architects, data engineers, analytics engineers and other data professionals with expertise in data modeling.

Each approach has its own advantages and disadvantages, and there is no one-size-fits-all solution, even though some proponents of a particular type may suggest otherwise. Often, beneath the one-size-fits-all approaches, there are assumptions that are impractical in practice. This can include well-defined and infrequently changing business data definitions.

Analytics engineers are specialists who possess knowledge in both creating a data model and implementing it. This involves gathering the requirements for the data model from the business, processing them into a model, and implementing it using techniques such as SQL and DBT.

What is a data infrastructure?

Behind a well-designed data model lies a robust data architecture, which establishes generic guidelines on how a data model should be created. These generic guidelines are generally sufficient, but specific data often requires tailored solutions.

As long as careful consideration is given to why a particular solution is needed for that data – and why there is a deviation from the generically prescribed solution – it can fit well within a mature data architecture.

An architecture is typically established by a solution architect. They can translate the business context and functional requirements into a technical design.

Need help or advice on storing data within your organisation?

Our data experts are happy to collaborate with you! Schedule an online meeting to discuss your challenges and explore solutions.

This is an article by Oskar

Oskar van den Berg is a Data Engineer and Team Lead with 7 years of experience at Digital Power. With his work experience as a Data Engineer in both startup and enterprise organisations, he can advise clients on technical and strategic levels.

Oskar van den Berg

Data Engineeroskar.vandenberg@digital-power.com

Receive data insights, use cases and behind-the-scenes peeks once a month?

You might find this interesting too:

Migration to the cloud: How does this work in practice?

In the past, all data from companies was stored locally in an on-premise environment. More and more companies are migrating their data infrastructure to the cloud. Cloud computing utilises servers managed and maintained by cloud service providers such as Amazon Web Services, Microsoft Azure, and Google Cloud Platform. In this article, you will read the answers to the questions you may have when considering a migration to the cloud.

What is machine learning operations (MLOps)?

Bringing machine learning models to production has proven to be a complex task in practice. MLOps assists organisations that want to develop and maintain models themselves in ensuring the quality and continuity. Read this article and get answers to the most frequently asked questions on this topic.

Webinar: Data Governance

In this webinar, we discuss the maturity model that we apply to quantify the maturity of different dimensions of data governance. Additionally, we provide concrete steps and implementation tips to start providing added value through data management.

20% fewer complaints thanks to data-driven maintenance reports

An essential part of Otis's business operations is the maintenance of their elevators. To time this effectively and proactively inform customers about the status of their elevator, Otis wanted to implement continuous monitoring. They saw great potential in predictive maintenance and remote maintenance.

Valuable insights from Microsoft Dynamics 365

Agrico is a cooperative of potato growers. They cultivate potatoes for various purposes such as consumption and planting future crops. These potatoes are exported worldwide through various subsidiaries. All logistical and operational data is stored in their ERP system, Microsoft Dynamics 365. Due to the complexity of this system with its many features, the data is not suitable for direct use in reporting. Agrico asked us to help make their ERP data understandable and develop clear reports.

Kubernetes-based event-driven autoscaling with KEDA: a practical guide

This article explains the essence of Kubernetes Event Driven Autoscaling (KEDA). Subsequently, we configure a local development environment enabling the demonstration of KEDA using Docker and Minikube. Following this, we expound upon the scenario that will be implemented to showcase KEDA, and we guide through each step of this scenario. By the end of the article, you will have a clear understanding of what KEDA entails and how they can personally implement an architecture with KEDA.

AWS (Amazon Web Services) vs GCP (Google Cloud Platform) for Apache Airflow

This article provides a comparison between these two managed services Cloud Composer & MWAA. This will help you understand the similarities, differences, and factors to consider when choosing them. Note that there are other good options when it comes to hosting a managed airflow implementation, such as the one offered by Microsoft Azure. The two being compared in this article are chosen due to my hands-on experience using both managed services and their respective ecosystems.

Transitioning from Universal Analytics 360 to Google Analytics 4 and Streamlining Data Analysis

There are currently a lot of developments surrounding Google Analytics, including user privacy (GDPR) and the sunset of Universal Analytics. For Miele X, the digital branch of Miele, GA4 was also one of the topics on their agenda as part of their bigger plans towards a more privacy-centric and vendor-agnostic way of data collection. They enlisted our help to support them with the transition from Universal Analytics 360 to GA4.

Insight into the complete sales funnel thanks to a data warehouse with dbt

Our consultants log the assignments they take on for our clients in our ERP system AFAS. In our CRM system HubSpot, we can see all the information relevant before signing a collaboration agreement. When we close a deal, all the information from HubSpot automatically transfers to AFAS. So, HubSpot is mainly used for the process before entering a collaboration, while AFAS is used for the subsequent phase. To tighten our people's planning and improve our financial forecasts, we decided to set up a data warehouse to integrate data from both data sources.

Data quality: the foundation for effective data-driven work

Data projects often need to deliver results quickly. The field is relatively new, and to gain support, it must first prove its value. As a result, many organisations build data solutions without giving much thought to their robustness, often overlooking data quality. What are the risks if your data quality is not in order, and how can you improve it? Find the answers to the key questions about data quality in this article.

The all-round profile of the modern data engineer

Since the field of big data emerged, many elements of the modern data stack became the data engineers' responsibility. What are these elements, and how should you build your data team?