Central data storage with a new data infrastructure
- Customer case
- Data Engineering
Dedimo is a collaboration of five mental healthcare initiatives. In order to continuously enhance the quality of their care, they organize internal processes more efficiently. Therefore, they use perceptions from the data that is internally available. Previously, they acquired the data themselves from different source systems with ad hoc scripts. They requested our help to make this process more robust, efficient and to further professionalise it. They asked us to facilitate the central storage of their data, located in a cloud data warehouse. The goal was to set up the data infrastructure within this environment, since they were already used to working with Google Cloud Platform (GCP).
To keep the project manageable during the start-up phase, we began with the delivery of an MVP. Thus, we selected a single data source to connect to the data pipelines that we were constructing. It was Dedimo's goal to expand the infrastructure in the long-term for supplemental data sources. Therefore, in the future, the MVP must be extended. That is why we considered the scalability of the solution into account when developing the MVP.
To secure scalability, we set up a scalable scheduler that kicks off data pipelines on a regular basis. This became Apache's open source scheduler: Airflow. Initially, we ran Airflow locally within the development environment, using Docker. For the manufacturing environment, we used a GCP-managed instance of Airflow in the form of 'Cloud Composer.'
As a data warehouse, we selected 'Bigquery,' GCP's serverless data warehouse solution. A major advantage of this is that the data warehouse is additionally scalable, and you only pay for the genuine use of it. Furthermore, a data lake has been set up in Cloud Storage, in which all data exported from the source databases is held centrally.
We produced the data pipelines in such a way that the ETL process (consisting of exporting the data lake, transforming the data and loading the data into the data warehouse) is performed fully by the Bigquery engine. Airflow is only used for what it is designed for, namely as a pure scheduler. This keeps the implementation simple and elegant, and therefore robust.
To manage Bigquery, we chose to use SQL code's that are dynamically generated using several Python scripts. These are kicked off by Airflow; a scalable and flexible solution. Through this method, the pipelines can be modified with a low number of configuration parameters. This is an effective method when connecting new source databases or changes in the source data. Then, the Python scripts ensure that the configuration parameters are translated into SQL encoded commands that the Bigquery engine comprehends.
With Cloud Build, Google's serverless CI/CD platform, we constructed the foundation of a development street that connects the local development environment to the production environment.
The delivered MVP is an original version of a production environment. This allows Dedimo to internally test further. Thus, this makes your personal data needs clearer and the current proof-of-concept can be further navigated and developed. To support Dedimo in adapting the production environment, we organize training in the use of Bigquery for its internal employees.
Moreover, we provided a development environment, with which the project can be further expanded with the MVP as a starting point. This development path was made possible by prioritizing scalability in infrastructure architecture choices.
In addition, by keeping the architecture simple and having the processing performed centrally by Bigquery, we laid a robust and reliable foundation for the future. In this regard, Dedimo uses digitization to consis focus on what Dedimo is best at, namely the care for its clients.
Want to know more about this project?
Reimer would gladly tell you all about it.
Business Manager+31(0)20 308 43 90+31(0)6 83 69 07 email@example.com
Receive data insights, use cases and behind-the-scenes peeks once a month?
Sign up for our email list and stay 'up to data':
You might find this interesting too
Data-driven work in a crisis organisation
Dienst Testen is a crisis organisation created during the corona pandemic. Under the banner of the Ministry of Health, Welfare and Sport (VWS), Dienst Testen ensures that everyone in the Netherlands can be tested quickly and reliably. Dienst Testen does this in collaboration with the municipal health services (GGDs) and laboratories. To quickly gain insight into the corona test figures in the Netherlands, Dienst Testen asked us and a number of other data consultancy parties to create dashboards in collaboration.
Deliver reliable and meaningful data to everyone from a solid, scalable infrastructure.
A well-organised data infrastructure
FysioHolland is an umbrella organisation for physiotherapists in the Netherlands. A central service team relieves therapists of additional work, so that they can mainly focus on providing the best care. In addition to organic growth, FysioHolland is connecting new practices to the organisation. Each of these has its own systems, work processes and treatment codes. This has made FysioHolland's data management large and complex.
Improved data quality thanks to a new data pipeline
At Royal HaskoningDHV, the number of requests from customers with Data Engineering issues continue to climb. The new department they have set up for this, is growing. So they asked us to temporarily offer their Data Engineering team more capacity. One of the issues we offered help with involved the Aa en Maas Water Authority.
5 reasons to use Infrastructure as Code (IaC)
Infrastructure as Code has proven itself as a reliable technique for setting up platforms in the cloud. However, it does require an additional investment of time from the developers involved. In which cases does the extra effort pay off? Find out in this article.
A scalable machine-learning platform for predicting billboard impressions
The Neuron provides a programmatic bidding platform to plan, buy and manage digital Out-Of-Home ads in real-time. They asked us to predict the number of expected impressions for digital advertising on billboards in a scalable and efficient way.
Why do I need Data Engineers when I have Data Scientists?
It is now clear to most companies: data-driven decisions by Data Science add concrete value to business operations. Whether your goal is to build better marketing campaigns, perform preventive maintenance on your machines or fight fraud more effectively, there are applications for Data Science in every industry.
Digital transformation and better internal collaboration thanks to insight into offline and online data.
Publisher Malmberg collects a lot of offline and online data. More and more educational institutions are using online licenses in addition to (or instead of) printed teaching materials. To properly make use of this, Malmberg uses monthly reports. The in-house data team compiles these as input for specific departments. Malmberg asked us to strengthen this team and make the internal processes around data more efficient.