The foundation for Data Engineering: solid data pipelines

Everything you need to know about data pipelines

Article
Data Engineering

Victor van den Broek

Data Scientist

4 min

06 Aug 2020

Basically, Data Engineers work on data pipelines. These are data processes that can retrieve data from a certain place and write it in somewhere. In this article you can read more about how data pipelines work and discover why they are so important for a solid data infrastructure.

Processing steps sometimes take place before, after or during data processes. Those processing steps can be simple, such as extracting a date from a timestamp, but also complex, such as adding a prediction from a data science model. Regardless of what exactly happens, the fact is - data is being retrieved somewhere, processing is taking place and data is being written in somewhere. This is done using data pipelines.

Data pipelines: bridging systems

These pipelines form the basis for a solid data infrastructure and are therefore executed with some regularity. Usually these pipelines run from a source system to a data lake or data warehouse.

A data lake or data warehouse is a storage location for data that is not (directly) used by operational systems. For example, a website will have its own data storage, but it may be interesting to store some data from that website elsewhere for analysis purposes. For example, so that it can be combined with data from your CRM or warehouse system. Building a data lake or data warehouse is done using such data pipelines.

Example: a data pipeline for an online store

Let's take the example of an online store's website. There is interest in developing a recommender system, so that relevant articles are offered when a visitor looks at another article.

To be able to develop this system, it is necessary that the logs of the website are usable for analysis. To unlock these logs, a Data Engineer can develop a pipeline that, for example, retrieves all data from page visits once a day and stores it in a data lake.

Solid data pipelines

Do you want to be able to trust that your data is processed properly? Then your data pipelines must be solid:

Example: no duplicate data

Solid means, for example, that your pipeline should easily be able to run twice, without having to deal with duplicate data, or that if it has not run once, it will retrieve the missing data the next time.

This can be achieved, for example, by keeping track of the time until which log files have already been retrieved. If the pipeline is running, it will fetch everything from that point in time. So you know that if you run it again, that duplicate data is not retrieved. This characteristic of a data pipeline is called 'idempotent'.

Example: error messages

Another example of solidness is in returning error messages; whether they are technical in nature or involve data content.

A technical error message may for example be that an authentication has failed – it was not possible to access a database with a certain username and password.

A data-content error message may for example be that much less data was retrieved than expected.

A solid data infrastructure

Of course there are many other possibilities to think about how to make a pipeline solid. In order to proceed first think carefully about what you want to achieve – in this case regularly opening up log data from the website.

After you have defined that well, you will investigate in which ways all this can go wrong, the odds, and what you can do about it.

By applying such controls and risk management measures in multiple places, the entire data infrastructure eventually becomes solid and reliable.

Do you need advice?

Do you want to start setting up data pipelines and a data architecture or do you see an opportunity to improve your current architecture? We are happy to help! Contact us directly.

This is an article by Victor van den Broek, Senior Data Science Engineer at Digital Power

Victor is an experienced Data Scientist with a sharp business focus. From his entrepreneurial background, he is always looking for the application of data in your business processes and how you can get maximum value from it, while the organisation remains flexible and agile.

Victor van den Broek