Improved data quality thanks to a new data pipeline
Aa en Maas Water Authority
- Customer case
- Data Engineering
At Royal HaskoningDHV, the number of requests from customers with Data Engineering issues continue to climb. The new department they have set up for this, is growing. So they asked us to temporarily offer their Data Engineering team more capacity. One of the issues we offered help with involved the Aa en Maas Water Authority.
The hydrologists of the Aa en Maas Water Authority use many different databases for their work. One of those databases contains data from sensor measurements, for example of the water level at various places in the region. It was not certain whether these data were always correct. They asked Royal HaskoningDHV to make a validation and to get the data quality in order. In this way the sensor measurements can be improved and the Water Authority can repair faulty sensors.
Aa en Maas Water Authority already had access to a data platform in Azure. On behalf of Royal HaskoningDHV, we developed an additional data pipeline that can be integrated with the existing platform.
We started with accessing the data from the database using Azure Data Factory. For this we used the layered system in the data lake. The raw data is further cleaned and enriched through the three layers of bronze, silver and gold:
- In the bronze layer we make the data usable for the structure of the existing data platform. This means that the raw data is converted to a table structure. For this we used Databricks and programmed in PySpark.
- In the silver layer we validate the data quality. For this we use a Python library made by RHDHV. It contains several data science components, including a functionality to perform data validation. We collect the data from 90 sensors that take a measurement every 15 minutes. Six quality labels are added to each of those measurements. It is how we study six possible data issues. If one of these six labels is set to true, this is an indication that something is wrong with the data.
- In the gold layer we convert the data to an XML format. This way the data can be returned to the original database, which can only receive data in this format. In this way we can add the enriched data in the database. The files are then moved to the serving layer of the data lake. The database administrators have access to this layer so that they can retrieve the enriched data.
When developing the data pipeline, we took on a few extra things to increase the robustness of the pipeline:
- We recorded which sensors go through the enrichment process at what time. This way you can always see when the model has been running and which sensors and time series have been validated.
- We identified changes in the metadata of the sensors. With each run, the metadata of the sensor data is now split and it is automatically checked whether anything has changed in the properties of the sensor (Slowly Changing Dimensions (SCD-II)).
- The data is processed daily in the data platform. We built a separate pipeline that makes it possible to retrieve historical data from specific sensors. In-depth analysis is also possible.
After going through the bronze, silver and gold layer, the data goes back to the original database. The data is then fully validated and enriched. The customer can see exactly which data is completely correct and which data may contain errors.
RoyalHaskoning DHV proved that it is technically possible to improve the data quality of the Aa en Maas Water Authority. We made an important contribution to this. We tested the pipeline for 90 sensors and built a fully scalable process for easy scaling in the future.
Want to know more?
Zev will be happy to tell you more about this assignment.
Business Manager+31(0)20 308 43 90+31(0)6 13 06 48 firstname.lastname@example.org
Receive data insights, use cases and behind-the-scenes peeks once a month?
Sign up for our email list and stay 'up to data':
You might find this interesting too:
The COVID-19 Violence Tracker
The outbreak of the corona pandemic in early 2020 has turned the world upside down. In addition to countless infections, hospitalisations and deaths, we also saw an outbreak of violence in many countries. Citizens took to the streets, sometimes violently, to protest against the measures taken, but domestic violence also increased in many places and fear and frustration played a role in racism.
Deliver reliable and meaningful data to everyone from a solid, scalable infrastructure.
5 reasons to use Infrastructure as Code (IaC)
Infrastructure as Code has proven itself as a reliable technique for setting up platforms in the cloud. However, it does require an additional investment of time from the developers involved. In which cases does the extra effort pay off? Find out in this article.
A well-organised data infrastructure
FysioHolland is an umbrella organisation for physiotherapists in the Netherlands. A central service team relieves therapists of additional work, so that they can mainly focus on providing the best care. In addition to organic growth, FysioHolland is connecting new practices to the organisation. Each of these has its own systems, work processes and treatment codes. This has made FysioHolland's data management large and complex.
Implementing a Data Platform
Based on our know-how, the purpose of this blog is to transmit our knowledge and experience to the community by describing guidelines for implementing a data platform in an organization. We understand that the specific needs of every organization are different, that they will have an impact on the technologies used and that a single architecture satisfying all of them makes no sense. So, in this blog we will keep it as general as we can.
Making impact measurable
The Designathon Works foundation organises Design Hackathons (Designathons) for children aged 8 to 12. The target? Teaching children from all over the world skills to become a 'changemaker'. They are challenged to design solutions for a better world, for example to combat climate change. From the Datahub, we helped Designathon Works fine-tune the impact measurements free of charge. We also made a first move towards automating data collection, analysis and visualisation.
A scalable machine-learning platform for predicting billboard impressions
The Neuron provides a programmatic bidding platform to plan, buy and manage digital Out-Of-Home ads in real-time. They asked us to predict the number of expected impressions for digital advertising on billboards in a scalable and efficient way.
Why do I need Data Engineers when I have Data Scientists?
It is now clear to most companies: data-driven decisions by Data Science add concrete value to business operations. Whether your goal is to build better marketing campaigns, perform preventive maintenance on your machines or fight fraud more effectively, there are applications for Data Science in every industry.