Bring structure to your data
Explaining the most commonly used forms of data storage
- Article
- Data Engineering
There are many different forms of data storage. In practice, a (relational) database, a data warehouse, and a data lake are the most commonly used and often confused with each other. In this article, you will read about what they entail and how to use them.
What forms of data storage are there?
In practice, these three forms of data storage are most commonly used:
Relational Database
- Typically referred to simply as a database, although there are also other forms of databases.
- Often used for operational processes.
- Data in a relational database is structured in a fixed data model and is frequently reused by other operational processes.
- Therefore, it is crucial that the data is transcribed accurately and reliably.
Data Warehouse
- A storage space for data not directly used by operational systems.
- Has a structured method of storage and is primarily used for analytical purposes.
Data Lake
- Shares many similarities with a data warehouse but provides more flexibility in how data is stored.
- Essentially, a data lake is a "regular" file storage, supporting various file types.
- Unlike the other two storage forms, a data lake does not have a built-in engine to process data.
What is a relational database?
A relational database is constructed according to a relational data model. Data is stored in tables, with rows and columns having a fixed structure (the 'schema'). Relational databases represent a traditional database structure.
The fixed structure ensures efficiency and consistency, resulting in high data reliability. Retrieving data from relational databases is done using Standard Query Language (SQL). Another characteristic of relational databases is their ability to be used by many users simultaneously.
The abbreviation OLTP is often associated with a description of relational databases. This abbreviation stands for 'online transactional processing.' The term 'transaction' indicates that relational databases excel in processing structured data, such as transactions from a store or online shop.
When to use a relational database?
- When precision is crucial, such as processing transactions or financial data.
- With smaller data volumes (i.e., not big data).
- When you need the ability to make easy adjustments.
When not to use a relational database?
- For data analysis, relational databases are often too slow.
- When dealing with large data volumes: relational databases are not distributed databases, and therefore, they can only scale vertically by adding more storage to the machine itself.
- If you require a flexible schema.
- Relational databases are not designed to handle unstructured data. If your data comes in various formats (json, xml), a NoSQL database might be a better alternative.
What is a data warehouse?
A data warehouse is highly suitable for data analysis purposes. In a data warehouse, data from various sources is loaded and merged. While a data warehouse operates with a relational structure, it can contain diverse types of data and is specifically structured for queries and analyses.
Data retrieval often occurs in batches, and not every time a transaction takes place in a source system, as is frequently the case in a relational database.
Traditionally, data warehouses employ an ETL (Extract, Transform, Load) pipeline:
- Extract – Data is retrieved from various sources (data lake, database, etc.).
- Transform – Data from different sources is combined and transformed into a data model optimised based on your goals.
- Load – Data from a data warehouse is often loaded into a BI or dashboarding tool.
The data model for a data warehouse must be designed to make it easy for users to work with the data and improve query performance.
Today, data processing is not always executed within the data warehouse itself but is applied using DBT (Data Build Tool). This technique involves storing all operations separately and largely agnostic from the data warehouse. It features interesting aspects, such as automatically conducting data quality tests and providing a quick and intuitive way of documentation. It also allows for the reuse of models and operations. However, you still use the query engine of the data warehouse and can leverage functionalities like the web interface and user management.
When to use a data warehouse?
- When you want to combine multiple data sources.
- When there is a need for a central location for a single source of truth.
- When better performance for analyses is required.
- When access to data needs to be centrally managed because all data is stored in one place.
- With large data volumes.
- When you have data that can be structured.
When not to use a data warehouse?
While a data warehouse is powerful and modern cloud data warehouses are often cost-effective, there are situations where it might not be the right solution:
- Operational data, such as transactions, is often better suited for a relational database. However, a data warehouse can be the right solution for analysis of this data.
- If the data is unstructured and does not fit into a tight schema, a data lake might be a better alternative.
- If your data has different formats (json, xml), a data lake or a NoSQL database could be a better alternative.
- For analytical purposes where timeliness is crucial, such as real-time analyses.
What is a data lake?
A data lake is a centralised storage where you can store both structured and unstructured data. Data lakes are capable of storing very large volumes of data with diverse structures.
A data lake can be compared to a folder on a computer where various files (text, CSV, JSON, photos, videos, audio, etc.) can be stored without the need for the data to be structured beforehand.
You can perform different types of analyses, create dashboards and visualisations, apply (big) data processing, and run machine learning models on data stored in a data lake. However, you do need a separate query engine for these operations. The convergence of these two technologies is referred to as a data lake house.
When to use a data lake?
- When you want to analyse unstructured data.
- With large volumes of data.
- When you desire the freedom to explore datasets and conduct ad-hoc analyses (common among Data Scientists).
- When you expect your data analysis use case to evolve in the future. A data lake is well-suited for storing data historically before processing it into a structured data model.
- When you want to analyse data without inserting it into a pre-defined schema. With "schema-on-read," you can run queries on files without creating a table beforehand.
When not to use a data lake?
While data lakes are flexible and versatile, they may not always be the ideal solution for every scenario:
- Operational data, such as transactions, is often better suited for a relational database due to data accessibility.
- Data analysis on structured data is sometimes simpler and faster in a data warehouse with a clearly defined data model.
- The unstructured nature of a data lake requires user-side structure to maintain an overview.
Storing data for analytical purposes? use a data model
When you intend to process data for analytical purposes in a data infrastructure, you need to store the data in a data model.
A data model can offer a lot of freedom, like storing files randomly on a laptop, or it can provide a structured order, such as in a database where all data is stored in a fixed schema.
The more freedom allowed in the storage method, the more challenging it becomes to process the data later for analytical purposes. Having a well-thought-out and organised data model is crucial for how the data can be used later. Queries that could have been simple may become very complex if the data model is not well-thought-out.
Data modeling is not a fixed process; it is iterative as new requirements and data are introduced. Flexibility is beneficial as new information becomes available.
The right data model for your organisation
The data model for a data warehouse can be developed by data architects, data engineers, analytics engineers and other data professionals with expertise in data modeling.
Each approach has its own advantages and disadvantages, and there is no one-size-fits-all solution, even though some proponents of a particular type may suggest otherwise. Often, beneath the one-size-fits-all approaches, there are assumptions that are impractical in practice. This can include well-defined and infrequently changing business data definitions.
Analytics engineers are specialists who possess knowledge in both creating a data model and implementing it. This involves gathering the requirements for the data model from the business, processing them into a model, and implementing it using techniques such as SQL and DBT.
What is a data infrastructure?
Behind a well-designed data model lies a robust data architecture, which establishes generic guidelines on how a data model should be created. These generic guidelines are generally sufficient, but specific data often requires tailored solutions.
As long as careful consideration is given to why a particular solution is needed for that data – and why there is a deviation from the generically prescribed solution – it can fit well within a mature data architecture.
An architecture is typically established by a solution architect. They can translate the business context and functional requirements into a technical design.
Need help or advice on storing data within your organisation?
Our data experts are happy to collaborate with you! Schedule an online meeting to discuss your challenges and explore solutions.
This is an article by Oskar
Oskar van den Berg is a Data Engineer and Team Lead with 7 years of experience at Digital Power. With his work experience as a Data Engineer in both startup and enterprise organisations, he can advise clients on technical and strategic levels.
Data Engineeroskar.vandenberg@digital-power.com
Receive data insights, use cases and behind-the-scenes peeks once a month?
Sign up for our email list and stay 'up to data':