Implementing a Data Platform

Guidelines according trends in 2022

  • Article
  • Data Engineering
Oscar-Mike-data-engineer
Oscar Mike Claure Cabrera
Data Engineer/Data Scientist
9 min
29 Apr 2022

Based on our know-how, the purpose of this blog is to transmit our knowledge and experience to the community by describing guidelines for implementing a data platform in an organization. We understand that the specific needs of every organization are different, that they will have an impact on the technologies used and that a single architecture satisfying all of them makes no sense. So, in this blog we will keep it as general as we can.

What is a (Big) Data Platform?

A Data Platform is a solution for acquiring, storing, processing, analyzing and delivering analytical data across an organization. It helps them to adopt end-to-end data driven approaches, to solve internal needs and automate tasks and decisions.

What is Data Architecture?

Data Architecture is the framework for an organization’s data environment. Data architecture is not the data platform itself, but the plan to follow when performing data operations like data acquisition, data transformation and data distribution. It includes the data models, policies, rules and standards that govern which data is collected and how to store and integrate it into the organization’s system. Whereas the data platform is the set of technologies that performs those tasks while obeying the rules defined on the first one.

There is always a data architecture even if it is not explicitly defined. If a data architecture has not been made explicit, the architecture is the compilation of often fragmented, implicit and informal rules of the organization when building, maintaining and running their data platform.

Considerations while designing a data platform

Data Architecture

As mentioned before, Data Architecture contains the plan to work with data. It is the starting point, what helps us define which components will be needed and how we will structure them together. 

Data

Of course, data is a vital component of a Data Platform and to define it, there are three main characteristics to consider, also known as the 3 Vs:

  • Variety: Type and nature of data. How big is the variation in the technical form and layout of the data, and the structure (or lack thereof) of its content? A bigger variety requires more specialized tools.
  • Volume: Amount of data. Millions of rows per hour or only hundreds? A large volume requires distributed processing and large variation in volume requires on-demand provisioning of infrastructure.
  • Velocity: Speed of data processing. How fast must you process the data?. Batch or streaming? The closer to real-time your information need is the more responsive a data platform must be to incoming data.

A modern data platform must be able to cope with the growth of a needs vector on those three axes. 

The following image shows the difference between batch and streaming processing. Batch processes the data events stored/buffered over a period of time while streaming processes them as soon as the data events are available. 

data-platform-streaming-versus-batch

Data Delivery (Democratized access)

The main purpose of implementing a data platform is to adopt end-to-end data driven approaches, so making data available for every team across the organization is essential.

Of course, not everybody should have access to all your data, but by following a proper data architecture we can define and limit which portions of data every team will need access to do their jobs as efficiently as possible.

Scalability

Scalability refers to the ability of a system to adapt to the change in demand. To grow, or scale-up when the demand increases, as well as to scale-down when the demand decreases. There are two approaches for scalability: horizontal scaling and vertical scaling.

Planning the scalability of your platform must be part of the design. Avoiding vertical scalability can save us from reaching computing or storing resource limitations. On the other hand, horizontal scaling allows us to scale easier due to its data distribution principle, but it adds a layer of complexity due to orchestration between the different nodes.

Trends in data platform architectures (Q1 2022)

Trends in software and infrastructure architectures are also important while implementing a Data Application and the infrastructure that will run it. On this section we will mention two of them touching upon the application and the infrastructure layers.

From monolithic to microservices

Monolithic

When designing an application, most of the time you start with a modular architecture containing components such as: data acquisition, storage, business logic and APIs for integration. But despite having a modular design, functions can end-up managed and served in one place.

monolithic-architecture

Here are some advantages and disadvantages of monoholitic architectures:

monolithic-architectures-1

Microservices

A microservice architecture is a design that structures an application as a collection or services that are highly maintainable, loosely coupled, independently deployable and owned by a small team.

microservices

Here are some advantages and disadvantages of microservice architectures:

microservices-advantages-disadvantages

From Server to Serverless Architectures

Server architectures continue to remain a valid option for some organizations, especially regarding environments where custom security and data protection must be implemented. The main drawback of this architecture is the cost, it is expensive to hire a team to take care of your servers, plan capacity increase and install software & security updates.

Serverless architectures follow the principle of letting a third-party organization take care of your infrastructure, allowing your team to focus on the applications.

Azure, GCP and AWS, are typical examples of this. They provide the computing power, storage capabilities and interconnections at any scale.

The image below illustrates the difference between server and serverless architectures. The grey boxes show the items where the teams need to put their effort to maintain the main applications.

difference-server-and-serverless-architectures

Containers and FaaS (Functions as a Service) are two examples of serverless architectures.

Containers are a solution to how to run your software in a reliable way when it is moved from one computing environment to another. They are ‘lightweight packages of your application code together with dependencies such as specific versions of programming language runtimes and libraries required to run your software services’.

Containers allow developers to focus on the application and its software dependencies, abstracting the hardware resources as well as the security. Some people can consider containers and serverless as two different approaches, but when we dig into the details they are quite alike. When a development team develops an application and packages it as a pod (set of Docker images containing its dependencies), they can deploy it on a cloud-hosted Kubernetes cluster which will provide the hardware to run it, abstracting the management of infrastructure from the development of the application.

On the other hand, FaaS allows developers to put their efforts into the applications by abstracting them in a way you focus only on the functions of them. Functions are the individual components that make your application, for example individual tasks like adding new data to a SQL table, authentication, adding new users, etc. These functions are triggered by an event, have a finite duration and are allocated on stateless containers (computing power) provisioned by the cloud provider based on the complexity of the coding and current demand.

An example of FaaS is Azure Functions in Azure, Lambda in AWS or Google cloud functions in GCP.

Diagram of a general data platform

Now that some concepts and trends have been discussed, we can start with a schema on how our data platform will look like, which components it should have and how it will manage and decouple the distinct steps in a managed data flow.

diagram-general-data-platform

As shown on the above diagram, we decided to split it into five main components:

  1. Data Ingestion. – This block is intended to manage the data acquisition from external sources. It should be able to acquire data in batch by scheduling ingestion tasks or/and manage streaming sources as well if required. Scalability is also an important feature for this block, it should have the flexibility to add multiple sources of data as well as manage larger amounts of it.

  2. Data Processing. – This block is dedicated to data processing. It should bring data to its final form as well as enable team members to perform exploration/testing at a larger scale.

  3. Data Storage. – Data should be stored in a central module (but not a single database or filesystem!) to avoid silos across the organization. This block must be able to serve different endpoints. Teams, as well as applications, should be able to access only data to which they have granted access. Data Transactions should be ACID when possible.

  4. Data Serving. – This general block represents the application(s) to which we will serve with data. It could include Visualization engines, AI & ML, Data Analytics, APIs or other endpoints.

  5. Data Governance. – Represents the capability of the organization to ensure data quality throughout a complete lifecycle of it. Its focus should be data availability, usability, integrity and data security.

Extra notes on the Data Storage component

As mentioned at the beginning of this blog, we intend to describe a general data platform architecture, so without being too specific on needs, we can go one level deeper to describe the Data Storage component. A Data Storage module must contain all your data but separated into different zones according to the nature of data.

The following schema shows 4 different data zones on our storage component as well as how data should flow when a batch ingestion approach is used:

  • Landing Data Zone: A landing data zone for third-party data sources can push data into your platform without compromising the data that your platform ingests.
  • Raw Data Zone: The zone where your raw data will be stored.
  • Curated Zone: The zone where your curated/processed/analyzed/modified data will be stored and from where it will be available to a wider audience across the organization.
  • Analytics Sandbox Zone: A playground where analytics teams can pull parts of raw and curated data and do explorative work with it.
batch-data-flow

In comparison to the Data Storage, the components Data Ingestion, Data Processing, Data Serving and Data Governance depend on the specificities of the organization’s requirements, so we will not go into detail any further on this blog. We will be happy to have a discussion with your organization if there is a specific need that you would like to address with us. Contact us here.

Join our team as a Data Engineering Consultant

Want to implement data platforms for various types of customers, while being part of a team of more than 100 highly motivated data professionals? View our vacancy and apply here.

This article was written by Oscar Mike Claure Cabrera, Data Engineer/Data Scientist, Digital Power

Oscar Mike is a Data professional, he enjoys working in hard core Data Analytics teams. With a background in different fields of engineering, he is a T-shaped Data Engineer/Data Scientist with experience in the telecom, manufacturing and aerospace industries.

Oscar Mike Claure Cabrera

Data Engineer/Data Scientist

Receive data insights, use cases and behind-the-scenes peeks once a month?


Sign up for our email list and stay 'up to data':