The outbreak of the corona pandemic in early 2020 has turned the world upside down. In addition to countless infections, hospitalisations and deaths, we also saw an outbreak of violence in many countries. Citizens took to the streets, sometimes violently, to protest against the measures taken, but domestic violence also increased in many places and fear and frustration played a role in racism.
In May 2020, PeaceTech Lab (US) reported to PeaceTech Lab NL and Digital Power with a prototype of the COVID-19 ViolenceTracker. Assisted by volunteers, PeaceTech Lab had manually collected reports on corona-related violence and visualised the insights in a dashboard (see figure 1). The lab then wanted to automate the collection of news items. This turned out to be an excellent challenge for our consultants!
Starting from May 2020, various Data Scientists, Data Engineers and Data Analysts from our team have contributed to building the tracker. We donated over 200 hours of work through our foundation, the Digital Power Datahub. We largely took charge of the technical implementation and discussed progress with the PeaceTech Lab teams on a weekly basis. The project ultimately consisted of eight phases:
We started with a (text) analysis of the vocabulary in the manually collected news items. The word cloud below is a visualisation of that. We looked at what the most common words were in the news stories and often came across "domestic", "violence" and "police" for example.
After research, we chose an appropriate method for data collection, which ended up being social listening with the Brandwatch tool. This allowed us to collect automated news items from the web that contained certain words (such as "domestic" and "violence").
We wrote a (search) query with the most relevant (combinations of) English words, for example "covid"+"violence".
From July 2020, we continuously collected news items via Brandwatch. In total we collected more than 9 million messages.
Although Brandwatch is a useful tool for data collection, it did not offer all the possibilities we were looking for in the analysis of our data. That is why we developed a separate data infrastructure in Google Cloud. The data from Brandwatch was automatically exported there, allowing us to work with the data ourselves.
As soon as we started the first analyses of the data, we found that there was a lot of noise (irrelevant news items) in our dataset. We wanted to get rid of this. That is why we held three validation rounds with volunteers who indicated which messages truly were about corona-related violence. By doing this, we started a major data cleanup.
Based on insights from these validation rounds, we optimised our query in Brandwatch. We removed words with double meanings, such as "beat" (meaning hitting, but also beating in sports) from the search. These were largely responsible for the noise.
Even with these adjustments, there was still a lot of noise in our data set. So it was time to bring out the big guns: we developed an NLP model with our validated datasets that learned to distinguish between relevant and irrelevant news items. Ultimately, we concluded the project with a properly cleaned dataset.
In September 2021, 16 Digital Power data specialists were given 3 hours to extract insights from the dataset that could be of interest to policymakers.
While this hackathon once again exposed data quality issues, our consultants also discovered some interesting opportunities. One of the teams, for example, worked on a text analysis to map the language use around various themes (such as racism).
A second team looked at geographic patterns in reporting on corona-related violence, and looked at possible links between (the volume of) reporting and press freedom in different countries. In short: although the data quality of the dataset is still not perfect, there is plenty to investigate with the COVID-19 Violence Tracker!
Curious about the dataset? Download it here.
We carried out this project via our Datahub foundation, together with PeaceTech Lab. In addition, the following partners contributed:
Receive data insights, use cases and behind-the-scenes peeks once a month?
Sign up for our email list and stay 'up to data':
You might find this interesting too:
Making impact measurable
The Designathon Works foundation organises Design Hackathons (Design-a-thons) for children aged 8 to 12. The target? Teaching children from all over the world skills to become a 'changemaker'. They are challenged to design solutions for a better world, for example to combat climate change. From the Datahub, we helped Designathon Works fine-tune the impact measurements free of charge. We also made a first move towards automating data collection, analysis and visualisation.
Deliver reliable and meaningful data to everyone from a solid, scalable infrastructure.
5 reasons to use Infrastructure as Code (IaC)
Infrastructure as Code has proven itself as a reliable technique for setting up platforms strongly in the cloud. However, it does require an additional investment of time from the developers involved. In which cases does the extra effort pay off? Find out in this article.
A well-organised data infrastructure
FysioHolland is an umbrella organisation for physiotherapists in the Netherlands. A central service team relieves therapists of additional work, so that they can mainly focus on providing the best care. In addition to organic growth, FysioHolland is connecting new practices to the organisation. Each of these has its own systems, work processes and treatment codes. This has made FysioHolland's data management large and complex.
Implementing a Data Platform
Based on our know-how, the purpose of this blog is to transmit our knowledge and experience to the community by describing guidelines for implementing a data platform in an organization. We understand that the specific needs of every organization are different, that they will have an impact on the technologies used and that a single architecture satisfying all of them makes no sense. So, in this blog we will keep it as general as we can.
Improved data quality thanks to a new data pipeline
At Royal HaskoningDHV, the number of requests from customers with Data Engineering issues continue to climb. The new department they have set up for this, is growing. So they asked us to temporarily offer their Data Engineering team more capacity. One of the issues we offered help with involved the Aa en Maas Water Authority.
A scalable machine-learning platform for predicting billboard impressions
The Neuron provides a programmatic bidding platform to plan, buy and manage digital Out-Of-Home ads in real-time. They asked us to predict the number of expected impressions for digital advertising on billboards in a scalable and efficient way.
Why do I need Data Engineers when I have Data Scientists?
It is now clear to most companies: data-driven decisions by Data Science add concrete value to business operations. Whether your goal is to build better marketing campaigns, perform preventive maintenance on your machines or fight fraud more effectively, there are applications for Data Science in every industry.