Implementing a data platform

Guidelines according trends in 2024

Article
Data Engineering

Oscar Mike Claure Cabrera

Data Engineer/Data Scientist

9 min

04 Apr 2024

Based on our know-how, the purpose of this blog is to transmit our knowledge and experience to the community by describing guidelines for implementing a data platform in an organisation. We understand that the specific needs of every organisation are different, that they will have an impact on the technologies used and that a single architecture satisfying all of them makes no sense. So, in this blog we will keep it as general as we can.

What is a (Big) Data Platform?

A Data Platform is a solution for obtaining, storing, processing, analysing and delivering analytical data across an organisation. It helps them to adopt end-to-end data driven approaches, to solve internal needs and automate tasks and decisions.

What is Data Architecture?

Data Architecture is the framework for an organisation’s data environment. Data architecture is not the data platform itself, but the plan to follow when performing data operations like data acquisition, data transformation and data distribution. It includes the data models, policies, rules and standards that govern which data is collected and how to store and integrate it into the organisation’s system. Whereas the data platform is the set of technologies that performs those tasks while obeying the rules defined on the first one.

There is always a data architecture even if it is not explicitly defined. If a data architecture has not been made explicit, the architecture is the compilation of often fragmented, implicit and informal rules of the organisation when building, maintaining and running their data platform.

Considerations while designing a data platform

Data Architecture

As mentioned before, a Data Architecture contains the plan to work with data. It is the starting point, what helps us define which components will be needed and how we will structure them together.

Data

Of course, data is a vital component of a Data Platform and to define it, there are three main characteristics to consider, also known as the 3 Vs:

Variety: Type and nature of data. How big is the variation in the technical form and layout of the data, and the structure (or lack thereof) of its content? A bigger variety requires more specialized tools.
Volume: Amount of data. Millions of rows per hour or only hundreds? A large volume requires distributed processing and large variation in volume requires on-demand provisioning of infrastructure.
Velocity: Speed of data processing. How fast must you process the data?. Batch or streaming? The closer to real-time your information need is the more responsive a data platform must be to incoming data.

A modern data platform must be able to accommodate the growth of needs on those three axes.

The following image shows the difference between batch and streaming processing. Batch processes the data events stored/buffered over a period of time while streaming processes them as soon as the data events are available.

Data Delivery (Democratized access)

The main purpose of implementing a data platform is to adopt end-to-end data driven approaches, so making data available for every team across the organisation is essential.

Of course, not everybody should have access to all your data, but by following a proper data architecture we can define and limit which portions of data every team will need access to do their jobs as efficiently as possible.

Scalability

Scalability refers to the ability of a system to adapt to the change in demand. To grow, or scale-up when the demand increases, as well as to scale-down when the demand decreases. There are two approaches for scalability: horizontal scaling and vertical scaling.

Planning the scalability of your platform must be part of the design. Avoiding vertical scalability can save us from reaching computing or storing resource limitations. On the other hand, horizontal scaling allows us to scale easier due to its data distribution principle, but it adds a layer of complexity due to orchestration between the different nodes.

Trends in data platform architectures (Q1 2024)

Trends in software and infrastructure architectures are also important while implementing a Data Application and the infrastructure that will run it. On this section we will mention three of them touching upon the application and the infrastructure layers.

From server to serverless architectures

Server architectures continue to remain a valid option for some organisations, especially regarding environments where custom security and data protection must be implemented. The main drawback of this architecture is the cost, it is expensive to hire a team to take care of your servers, plan capacity increase and install software & security updates.

Serverless architectures follow the principle of letting a third-party organisation take care of your infrastructure, allowing your team to focus on the applications.

Azure, Google Cloud Platform (GPC) and Amazon Web Services (AWS), are typical examples of this. They provide the compute resources, storage capabilities and interconnections at any scale.

The image below illustrates the difference between server and serverless architectures. The grey boxes show the components where the teams need to put their effort to maintain the main applications.

difference-server-and-serverless-architectures

Containers and FaaS (Functions as a Service) are two examples of serverless architectures.

Containers are a solution to how to run your software in a reliable way when it is moved from one computing environment to another. They are ‘lightweight packages of your application code together with dependencies such as specific versions of programming language runtimes and libraries required to run your software services’.

Containers allow developers to focus on the application. In addition, software dependencies, both hardware resources and security, are abstracted. Some people can consider containers and serverless as two different approaches, but when we dig into the details they are quite alike. When a development team develops an application and packages it as a pod (set of Docker images containing its dependencies), they can deploy it on a cloud-hosted Kubernetes cluster which will provide the hardware to run it, abstracting the management of infrastructure from the development of the application.

On the other hand, FaaS allows developers to put their efforts into the applications by abstracting them in a way you focus only on the functions of them. Functions are the individual components that make your application, for example individual tasks like adding new data to a SQL table, authentication, adding new users, etc. These functions are triggered by an event, have a finite duration and are allocated on stateless containers (computing power) provisioned by the cloud provider based on the complexity of the coding and current demand.

An example of FaaS is Azure Functions in Azure, Lambda in AWS or Google Cloud Functions in GCP.

Growing need for generative AI solutions

The rapid rise of generative AI has further increased the need for high-quality data. The modern data platform must be able to process these large datasets of structured and unstructured information and provide it to the models.

The storage for this AI Processing must enable low latency and real-time access to data. In addition, secure storage of business information is critical for many internal use cases. This brings additional privacy and security requirements.

Generative AI models often require a lot of computing power for training and inference, this will make the need to quickly scale up more powerful hardware critical. Distribution of compute resources within the data platform is needed to ensure performance and scalability.

Good data governance is essential

More and more organisations are dealing with a growing amount of data coming from a wide range of internal and external sources. Think of a variety of structured databases to unstructured images, texts, audio files and videos. In addition, we see increasing regulatory pressure to handle (personal) data accurately.

To control and effectively deploy all these data sources, it is important to manage them in a structured way. Ultimately, high data quality is necessary because data forms the basis for decision-making within organisations. Confidence in the quality of the available data also plays a major role here.

The following technical functionalities of a data platform can support the business requirements in terms of data governance:

Establishing data ownership.
The availability of a data catalogue to provide transparency about the available data and how it is created (data lineage).
Broad capabilities to add metadata to the available datasets.
Granular management of data access.
Enforcing policies by applying rules and conventions.
Multiple options for storing data securely, including through encryption, anonymisation and access control.

Diagram of a general data platform

Now that some concepts and trends have been discussed, we can start with a schema on how our data platform will look like, which components it should have and how it will manage and decouple the distinct steps in a managed data flow.

As shown on the above diagram, we decided to split it into five main components:

Data Ingestion. – This block is intended to manage the data acquisition from external sources. It should be able to acquire data in batch by scheduling ingestion tasks and manage streaming sources as well if required. Scalability is also an important feature for this block, it should have the flexibility to add multiple sources of data as well as manage larger amounts of it.
Data Processing. – This block is dedicated to data processing. It should bring data to its final form as well as enable team members to perform exploration/testing at a larger scale.
Data Storage. – Data should be stored in a central module (but not a single database or filesystem!) to avoid silos across the organisation. This block must be able to serve different endpoints. Teams, as well as applications, should be able to access only data to which they have gained access. Data Transactions should be ACID when possible.
Data Serving. – This general block represents the application(s) to which we will serve with data. It could include Visualization engines, AI & ML, Data Analytics, APIs or other endpoints.
Data Governance. – Represents the capability of the organisation to ensure data quality throughout a complete lifecycle of it. Its focus should be data availability, usability, integrity and data security.

Extra notes on the Data Storage component

As mentioned at the beginning of this blog, we intend to describe a general data platform architecture, so without being too specific on needs, we can go one level deeper to describe the Data Storage component. A Data Storage module must contain all your data but separated into different zones according to the nature of data.

The scheme shows 4 different data zones on our storage component as well as how data should flow when a batch ingestion approach is used:

Landing Zone: The zone where your raw data are stored in the format in which they are offered by the external source system.
Bronze Zone: The zone where your raw data is made available in tabular form, without further transformations.
Silver Zone: The zone where your data is cleaned up, merged and aligned to give an overview of your most important business areas.
Gold Zone: The zone containing ready-to-use and project-specific databases. The data in this layer is intended for reporting and providing business insights. For this reason, they are often denormalised and optimised for reading.

In comparison to the Data Storage, the components Data Ingestion, Data Processing, Data Serving and Data Governance depend on the specificities of the organisation’s requirements, so we will not go into detail any further on this blog. We will be happy to have a discussion with your organisation if there is a specific need that you would like to address with us. Contact us here.

Join our team as a Data Engineering Consultant

Want to implement data platforms for various types of customers, while being part of a team of more than 100 highly motivated data professionals? View our vacancy and apply here.

This article was written by Oscar Mike Claure Cabrera, Data Engineer/Data Scientist, Digital Power

Oscar Mike is a Data professional, he enjoys working in hard core Data Analytics teams. With a background in different fields of engineering, he is a T-shaped Data Engineer/Data Scientist with experience in the telecom, manufacturing and aerospace industries.

Oscar Mike Claure Cabrera