COVID-19 — Beyond the Dashboards
Data Scientist Capstone Project
2020 will go down in history as the year the SARS-CoV-2 pandemic had a firm grip on the world.
From a tech perspective, and on a smaller (yet significant) scale, it also marks a turning point after which knowing what a dashboard is became general knowledge. People are flocking to WHO, Johns Hopkins and other national and/or private dashboards in order to keep informed about the daily development of the worldwide pandemic.
Live dashboards mostly focus on the accumulated numbers of infections and deaths. Sometimes detailed information like case mortality per age group or gender are included. That way, they provide a quick overview of the situation and provoke a certain sensationalism in people: 5,000 new cases today! Already 2% of the population vaccinated!
In my home country Germany, most of the COVID-related legislation is implemented by federal states and local authorities. Hence, dashboard are also very well suited to keep track of whether you might expect stricter rules in daily live or when commuting: if your destination currently suffers high infection rates, travel restrictions or extra rules like wearing a mask in inner city areas may apply.
What happens behind a flashing red map and more or less exponentially rising numbers though remains a mystery for many — and a critical one in my opinion. Authorities are struggling with people’s acceptance of harsh measures dictating their lives and interesting theories like this one have been on the rise:
“If we tested less, we would see less cases. If we had less cases, there wouldn’t be a pandemic.”
While it might be a hopeless endeavour to argue with such flawless logic, it actually points to a valid question: In what ways does the virus actually spread and how accurate is the data we have? Officials were overwhelmed with the challenges of case tracing and especially large demonstrations attracting people from all over the country raised further concerns of infections spreading fast and numbers of undetected cases rising.
Problem Statement & Approach
Discussing potential dark figures and fostering better understanding of the COVID-19 pandemic is not only crucial to effectively protect others — it is also the core issue I wanted to tackle. Hence, this project focuses on the following leading questions in order to understand the development of the Coronavirus pandemic in Germany in 2020:
- How did Coronavirus infections spread across the country and what were the affected groups?
- Based on this, how (well) can we estimate “dark figures”, i.e. the number of undetected Coronavirus infections using a statistical approach?
To address the first question, analysis of given data is done using general interpretation and visualization, as well as descriptive statistics and further research on detected patterns within the development of the pandemic over time. The main focus here is to assess to what extend different demographic groups were affected both during phases of low case numbers and during the so-called pandemic waves, when infections spread exponentially. In detail, case distributions across age groups are taken into account as well as documented breakout clusters and general information on clinical reporting and test rates.
For the second question, these observations are discussed with regard to the underlying assumptions on the demographic distribution of infections, on which the estimation of dark figures is based. Finally, the estimation of weekly undetected Coronavirus cases is implemented and evaluated with respect to the above findings. The implemented approach contains no further modeling or prediction of future data.
Restrictions, challenges and potential improvements are discussed within the respective sections.
Because by its very definition, estimation of undetected cases can hardly be validated, i.e. measured with a specific metric directly, the discussion above will be used to assess the results. Furthermore, I will compare the method used here to another approach applied to the same data source.
The main analysis and dark figures estimation was carried out in a Jupyter Notebook. Custom functions for data preprocessing and visualizations were outsourced to adjacent Python modules.
Estimating Dark Figures — Approach
There are different methods in calculating the number of undetected cases of an infectious disease, or dark figures, of an infectious disease. Many take into account different parameters like case mortality and the prevalence of symptoms among different age groups or between sexes. All methods are based on one thing though — assumptions. Especially for a novel disease like the Coronavirus, there is (yet) little data from studies. This makes it all the more important to analyze given data with respect to the validity of assumptions made for the estimation of undetected cases.
I implemented a statistical approach based on age weighed by the influence of social contacts on the spread of infectious diseases. Further uncertainty is added by asymptomatic cases. The approach was introduced in this blog by a team from the Fraunhofer Institute for Industrial Mathematics and applied to cases in Germany and Italy in April 2020.
Statistical Estimation of Undetected Coronavirus cases based on
The basic assumption is that statistically, infections — whilst having spread to the whole population — are likely to be more prevalent in younger people, due a higher mobility and number of social contacts. At the same time, symptomatic cases are more likely to be detected. For the Coronavirus, it has been confirmed that elderly people are at a higher risk to develop severe symptoms when infected.
Assuming that this leads to a higher number of cases being detected, cases among the elderly population can be taken as a benchmark for undetected infections among the whole population. The relative number of social contacts is then used as a factor in all other groups. In a last step, a further increase of undetected cases, or uncertainty, is added to account for asymptomatic cases.
Hence, the major steps to carry out for the estimation itself are:
- get reported case incidences across all age groups over time
- replace reported incidences with appropriate benchmark incidence
- factorize case incidences with the relative number of social contacts
- calculate estimated total case numbers over time using population data
- account for additional uncertainty due to asymptomatic cases
Details are described in the Implementation section below.
Data Sources & Wrangling
Due to the relevance of the topic, there are numerous sources of information on the Coronavirus. For reliability and to avoid undesirable effects on my estimation, official data was drawn from the Robert Koch Institute (yes, they have a dashboard, too). For those unfamiliar with this institution, a little background on the data itself and its collection process:
The Robert Koch Institute (RKI) is a German federal government agency and research institute responsible for disease control and prevention, subordinate to the Federal Ministry of Health. The institute researches and monitors public health and as such, supplies a large pool of publicly accessible data on both infectious and non-communicable diseases. The RKI also acts as a national advisor for vaccinations and other disease control policies.
RKI data on Coronavirus development in Germany can be found here. The following characteristics and restrictions apply:
- The RKI supplies official records from the local health authorities, i.e. no estimations, predictions or other sources such as social media. These records only include cases where an infection (regardless of the presence of symptoms) was confirmed by a laboratory PCR test. This limitation is crucial to avoid undesirable effects on our own estimation of undetected cases.
- The location of an infection is the district or city of the reporting local health authority — we have to assume that this usually is the place of residence of the person infected.
- dashboard data (daily new cases and deaths per age and region) is processed and updated daily, but is subject to delays in the reporting process. A few days can pass between notification of local authorities, testing and case reporting to the RKI. Weekly statistics are updated on different days of the week, so again, the most recent data might be unreliable.
- While the raw number of confirmed cases and deaths has been collected since the beginning of the year, systematic reporting of detailed information (i.e. detailed classification of age groups, tracking of infection sources, symptoms etc.) has been implemented later.
Due to the latter restrictions, the main focus of the analysis is the period between late February and December of 2020 (data was downloaded on January 18th).
Supporting data on population was drawn from the German Federal Statistical Office:
Socio-demographic parameters for the calculation of undetected cases were taken from this paper on Social Contacts and Mixing Patterns Relevant to the Spread of Infectious Diseases (Mossong et.al., 2008).
Data Quality & Preprocessing
While it was helpful that there was a variety of publicly available information beyond the content of the RKI dashboard, the data suffered from inconsistent formatting and, as stated above, missing information especially at the beginning of the year.
The main challenge during preprocessing was that RKI data, supplied as excel/csv tables, was often formatted inconsistently: sometimes different puntuaction (2.000 vs. 2,000) was used, date formats varied or numbers were given as strings (e.g. “<4” instead of 0…3). Hence, cleaning mainly included string extractions and replacements in different columns before numeric conversion could be done. Furthermore, I renamed columns where necessary to be suitable for pandas operations and merged information from different tables where possible.
Since daily and weekly case numbers were mixed in this dataset, I also made sure every table included the calendar week as a common time-based reference for comparison between tables (with week 53 mapped to the year 2020).
Preprocessing yielded the following tables for my analysis and estimation:
- overview: daily and weekly development in new cases and deaths over time in all federal states and per general age group and sex over time, including both the infection and the reporting date.
- cases_age: weekly number of new cases and case incidences per age group in detailed 5-year intervalls
- clinical: weekly numbers of reported clinical indications (hospitalization, symptom prevalence) and deaths per sex
- deaths: weekly number of deaths per age group and sex
- breakouts: weekly number of cases in breakouts (defined as 2 or more related cases) over time, including the respective enironment in which the infection occurred
- tests: weekly testing capacities, and tailbacks
Analysis: How did Coronavirus infections spread across the country and what were the affected groups?
During 2020, Germany has seen 2.04 Million Coronavirus cases in 2020 and went through two major pandemic waves: the first one in March and April and the second, more severe one starting in autumn after calendar week 40. During summer (week 20 through 40), case numbers were as low as 2,345 per week. Weekly new infections peaked during week 51, with a maximum of 175,000 new cases:
Geographic & Demographic Aspects
In order to analyze geographic and demographic aspects of the development of infections over time in detail, I chose heatmap visualizations instead of the line plots and interactive maps popular in various dashboards. Due to the large number of age groups and regions to compare, this provides a comprehensive overview of the development over time:
This geographic development over time already presents some interesting details. During the first wave, infections mostly coincided with areas of high population or high population density: both total case numbers and incidences, i.e. cases per 100,000 inhabitants, were highest in those states with the largest and youngest overall population as well as in large cities like Hamburg or Berlin. One exception here is Saarland close to the French border (a country which was hit severely during the first wave). While we cannot draw specific conclusions here, these are first indications for the relevance of age along with mobility and the number of social contacts in the development of the first wave.
During the second wave, incidences were high everywhere, indicating that the virus had by then infested the whole country instead of occuring in local clusters. While the sparsely populated North (Mecklenburg-Western Pomerania and Schleswig-Holstein) still had the lowest incidences, two neighbouring states in eastern Germany, Thuringia and Saxony, were overproportionally hit. There are still vivid discussions about possible socio-demographic reasons behind this.
The development of Coronavirus infections among age groups over time supports the conclusion from above that by the second wave, the virus has spread across the whole population. Looking at the age structure of Germany, it makes sense at first sight that the largest number of cases occurs in the largest groups of 50–60 and 25–35 years of age. So to compare how different age groups were affected, we have to look at the case incidence again, this time within the respective age groups.
By case incidence, the elderly population of 90+ years of age is most heavily affected especially in the second wave. But due to the huge range of case incidences across both age groups and time, we don’t see a lot of detail on the heatmap especially in summer 2020.
In order to tackle this issue and compare age groups more effectively, I thus divided the Coronavirus case incidence for each individual age group by the case incidence among the general population over time (which is shown in the first line in the two plots above):
This yields the proportion to which a specific age group in affected in comparison to the general population. Bear in mind though that the latter is influenced heavily by the larger population groups (i.e. the population 50–65 years of age), so these groups tend to be near a relative value of 1.
Now we can see that during both pandemic waves, incidences among those over 90+ years are up to 5 times higher in comparison to the overall population, with a peak during the first wave.
During summer though, there is a shift: as most restrictions (except for bans on large events) were lifted, people traveled and the more mobile younger generations had the highest incidences. Especially those aged 20–30 were affected during summer months. In week 41/42, at the beginning of the second wave, infections quickly spread to all other age groups again.
During the whole year, overall reported infections did not occur evenly across age groups. The fact that younger generations show a much higher case incidence during summer makes adjustments to the original approach chosen for the estimation of dark figures necessary. These are discussed in the Implementation section.
Overall, the above figure still supports the assumption made for our approach to estimate dark figures, that infections are (at times) overproportionally prevalent among the elderly population — yet, we still need to find suitable benchmarks for the summer, take into account the reliability of cases documentation and investigate other aspects of the pandemic development over time.
On a side note: Society seems to have done a good job in protecting the younger part of the risk group. Case incidences among those aged 65–75 are consistently below the general incidence. Multiple factors could make social distancing easier for them than for others, including:
- this group has reached retirement age, and potential kids most likely have moved out of the household
- at the same time, this group has the highest rate of home owners among all age groups, so #stayathome can be comfortably implemented for many
- contact with facilities like hospitals and nursing homes increase with age
Traced Sources of Infections and Breakouts — Where Did Coronavirus Infections Occur?
Apart from age groups and sex, authorities started to record where infections occured in order to be able to trace and break infection chains when the first pandemic wave became immanent. These breakouts, defined as two or more related cases, were reported together with the specific setting in which they occurred. For a first overview, I visualized the documented breakouts in terms of total case numbers per setting and the share each infection setting has on the overall situation at the time:
According to this data, private households continuously account for a large number of infections. Similarly, both pandemic waves come with a massive surge of cases in retirement/nursing homes along with other care and medical facilities such as general hospitals, both in total and relative to other sources. Of all breakouts recorded, those in retirement/nursing homes account for up to 40% of cases. This coincides with the fact that people over 90 were overproportionally affected during these times (1 in 4 among this age group is in a nursing home), so we can suspect a high correlation here.
We can also assume that numbers in medical facilities are among the most reliable: especially nursing homes implemented severe restrictions and routine testing processes throughout the year, and official rules on isolation imposed by authorities are stricter. Hence, using case incidences among the elderly as a benchmark for estimating dark figures can be regarded a reasonable choice. From the statistics above is unclear though whether infections among medical personnel are allocated to the workplace category or the medical facility they work in.
During summer, when overall case numbers were low and restrictions lifted, the share of infection sources which can be attributed to private households, traveling and social life grows. This goes hand in hand with the higher incidences among younger generations (20–30 year olds had the highest incidence) mentioned earlier.
Actual correlations back the above observations: elderly people being overproportionally affected by Coronavirus infections (in comparison the general population) highly correlate with breakouts in nursing homes and medical facilities. Younger people on the other hand seem to rather catch infections in settings related to social live. As the “babyboomers” continuously have a relative case incidence close to the general population, we would need another type of visualization to investigate Coronavirus infections among this group.
Abnormal Observations and Implications on Data Reliability
Apart from this, we can observe other specific effects throughout the year together with the heatmap of relative case incidences across age groups: relative incidences have slightly visible peaks for those aged 44–54 in spring and for those aged 40–44 and 20–29 in June 2020.
- At the same time in summer, there is a distinct peak in cases occurring in workplace settings. This is directly linked to a single breakout in a meat factory with over 2000 infections among employees and relatives in the respective age ranges. The vast majority of the infected was asymptomatic or had only mild symptoms. Extensive contact tracing was set in place to prevent any spread of infections to the general population.
- The massive share of Coronavirus infections occuring in early spring can be traced back to the first major breakout in Germany during a carnival party in late February. As this event is generally considered to mark a turning point for Coronavirus in Germany and the beginning of the first pandemic wave, there is even a Wikipedia article.
While it is interesting to events like these in the data, there are implications on calculating undetected cases as well. Political measures taken to stop the spread of the disease — like rigorous shutdowns of leisure facilities, hotels and gastronomy — are clearly visible in the number of reported breakouts and case incidences both in spring and winter.
This is unprecedented for any infectious disease in the recent past and most likely has a damping influence on the relative number of social contacts used to estimate undetected cases especially for younger age groups.
Untraced Sources of Infections and Breakouts
Since only two or more related cases are counted as a breakout in the records above, and since there is no data on infection sources of individual cases, we have to assume that the latter are unknown. Plotting total case numbers throughout the year as the sum of untracked individual cases and those attributed to a breakout visualizes this missing information:
On average, 30% of cases are attributed to a breakout setting. When infection rates rose the fastest, i.e. at the beginning of both pandemic waves, this percentage drops significantly, to 14.1% during the first wave and 12.6% during the second. Only in February and during summer, when infection rates were low, a larger share of cases was traced back to a specific infection setting. The beginning of the first wave and the well-documented meat factory incident make up for significant peaks here, again.
Adding individual cases to the “unknown” category of infection settings yields more realistic picture of the status of case tracing. The vast majority of cases now comes from unknown sources — during the second wave, around 85%. This indicates that especially during times of high infection rates, authorities were overwhelmed with case tracing and would hence be at a higher risk to miss infection chains. This in turn would drive up the number of undetected cases.
Testing and Reporting
This is further supported by the number of administered PCR tests and the number of cases for which the prevalence of symptoms and the hospitalization status were documented. Despite almost exponential development of infections especially during the second pandemic wave, testing capacities only increase linearly over the year, with a distinct drop over the christmas holidays. At the same time, the documentation of patients’ medical status steadily decreases throughout the year, from around 90% to as low as 50%.
The effects of both pandemic waves are also clearly visible in the relation between positive test results and total number of tests administered: this rate drops from 18% at the beginning - little test capacities, testing only symptomatic cases - to an average of below 0.6% during the whole summer, when broad testing for travellers, contact tracing and routine testing of potential contact persons without symptoms were implemented. Together with the low overall numbers of infections, we could hence assume a higher accuracy of official data during that time. During the second wave, the rates of positive test results increase again to up to 10%.
Does the Analysis Support the Approach taken to Estimate Dark Figures?
Coronavirus infections have indeed quickly spread to all age groups. Especially during the second pandemic wave, case incidences were high across nearly all age groups and federal states. Over time and due to various events and measures to break infection chains, there are several shifts in case incidences and in the proportions to which different age groups were affected. This challenges the original approach chosen for the estimation of dark figures, which simply assumed the elderly population as the benchmark.
Given the high (relative) incidences among those over 85 and 90 years of age along with the high number of documented cases in related medical settings, it still seems a reasonable choice to take this group as the benchmark for estimating dark figures during both pandemic waves.
During the summer time though, we have to take into account the shift in case incidences mentioned above, where younger generations are more heavily affected than the general population. Given the low case numbers in and the extremely low average rate of confirmed cases in all PCR tests of 0.56% through weeks 20 till 40, we assume that during this time official numbers were indeed quite reliable. Therefore, we switch to the actual reported cases per age group as a benchmark here. As many socio-economic restrictions were temporarily lifted over the summer, we still factorize the given numbers with the relative number of social contacts for the respective age groups.
To account for further uncertainty due to asymptomatic cases, the ITWM originally added a range of 0–40% of cases to their original estimation. This is supported by epidemiological studies cited on the RKI website suggesting that overall, around half of the virus transmissions can be attributed to asymptomatic cases or cases where symptoms have not developed yet.
We have seen that in breakout clusters, rigorous contact tracing and testing routines can actually uncover many otherwise undetected infections. Given the high numbers of cases with an unknown infection source though, the chosen range remains a reasonable choice.
Implementation & Results — Estimating Dark Figures Across Age Groups
For our calculation, we need the case incidences across age groups as well as the relative number of social contacts to multiply the case incidences with. The original table for the latter by Mossong et.al. is shown below.
This table uses the age group 0–4 as reference, so as a first step, we have to rescale the relative number of social contacts for the desired population group of elderly people to be the benchmark. Furthermore, this table does not differentiate any age groups above 70 years, some other age groups are given in 10 year intervals. As Coronavirus case incidences per age group are provided in 5 year intervals, we thus have to either restructure the age groups within our case incidence tables or apply some factors more than once: for example, age groups 70–74, 75–80, … etc. all get the same benchmark factor of 1.
We then the assume the benchmark incidence of those 70+ years of age for all age groups, with the limitation that if any other group had a higher incidence at the time, we would suspend this mechanism. As expected, this yields a table where all case incidences are overwritten by the benchmark incidence during both pandemic waves, while incidences during the summer remain unchanged. This way, we avoid replacing any case incidence with a lower-than-reported value.
In the next step, we multiply case incidences by the relative number of social contacts to get the estimated incidences of coronavirus cases in all age groups. It quickly becomes apparent, that the factorization with estimated social contacts leads to a massive surge in cases among younger generations during the second wave:
Then, the total number of estimated cases is calculated using the population data for each demographic group:
Plotting estimated cases over confirmed infections visualizes the surge in case numbers: up to 5 times more Coronavirus infections than reported are estimated by our approach during the second pandemic wave, up to 10 times in the first one. The 95% confidence intervals shown in the plot are attributed to the factorization with the relative numbers of social contacts.
Zooming in on case numbers during summer reveals that even with official case incidences and no benchmarking between age groups, estimated case numbers are much lower in comparison, but still at least twice as high as indicated by records.
Finally, the 40% uncertainty to account for asymptomatic cases is added, yielding a potential maximum of over 1.3 million undetected weekly cases during the second wave:
Evaluation & Improvements
Evaluation of the Implemented Approach
In order to assess my estimations, I compared the results with this dashboard for another approach for estimating undetected Coronavirus infections based on case mortality. The implemented approach uses differential equations, wherein the case mortality can be chosen as a configurable parameter by the user.
This implementation specifically models state transitions between those infected (both symptomatic and asymptomatic), deaths and recoveries. That way, the number of social contacts is taken out of the equation — but another uncertainty is introduced with the assumption of the duration of an infection.
The dashboard focuses on the cumulative numbers as well as on currently active cases. For a mortality rate of 0.6%, the model shows up to 5 times as many active cases as officially reported at the peak of the second wave in CW 51. Hence, our statistic approach and the model come to conclusions on dark figures which are within the same order of magnitude.
Numbers of undetected cases more than double if the case mortality parameter is decreased. Further research and data would be needed here in order to more accurately parametrize the model and hence improve the comparability with the approach taken here.
The approach taken only takes into account the influence of social contacts as a mere weighing factor. This is a rather broad simplification and could be improved further by taking into account other parameters, e.g.:
- the standard duration of an infection (until symptoms are developed and the infected person is isolated)
- the infectiousness of people (especially in asymptomatic cases)
In this regard, one could set up a time based model instead of using a static approach. As the reliability of current data is limited due to incubation times, reporting errors/delays etc., I personally would refrain from approaches to estimate daily or future numbers of new infections.
Regardless of these refinements, further data is needed in order to adjust the relative number of social contacts, especially with regards to the effects of the political measures taken in 2020. Apart from smaller age intervals, a distinction between (elderly) people in stationary care facilities and those who can still enjoy the pleasures of retirement in their own home would make sense.
Also, the idea of a dashboard with configurable input parameters would be appealing: While this does not immediately improve the approach as such, it would allow the community to participate and allow a data driven discussion of dark figures in general.
In this article, I analyzed the development of the Coronavirus pandemic in Germany in 2020 and, based on this, estimated undetected numbers of infections.
General Aspects & Restrictions
This global pandemic was, and still is, an overwhelming experience to which we had to adapt to in many ways. Analyzing such a novel disease poses some major restrictions to making estimations:
- systematic reporting and collection of data was implemented over the course of the pandemic. Documentation of breakouts, test rates and other information points to an overload of authorities, putting into question the reliability and completeness of given data during massive increases in case numbers.
- this already limited amount of data (less than ~1 year) is divided into opposingly different phases either dominated by specific breakouts or diffuse occurence of infections with exponential growth.
- Given the novelty of the disease, there are a lot of uncertainties in general, including the prevalence of symptoms with regards to mechanisms of transmission, duration of infections etc.
Development of the Pandemic & Estimation of Dark Figures
How well Coronavirus cases are documented seems to strongly depend on the context they occur in. Distinct breakouts in closed settings like the described meat factory breakout in June 2020 were clearly visible in the data. Rigorous testing routines in medical environments indicate a comparably high amount of reliably documented cases especially among the elderly population in the same way. On the other hand, the vast majority of infections even among documented cases remains untraced, so we can assume significant numbers of undetected cases, especially asymptotic ones.
Social patterns such as the relative number of contacts used to calculate dark figures might apply well during summer, where most restrictions were lifted and age groups with a higher mobility did indeed show higher case incidences. But the socio-economic measures taken to combat the spread of the pandemic in total are still likely to have a dampening influence on the validity of these parameters in the context of calculating dark figures for the Coronavirus. These unprecedented measures taken to sustain the pandemic generally pose a major restriction to the parameterization and validation of both estimation approaches discussed here.
To sum it up: due to the very nature of dark figures it is hard to verify estimations in this context, and the conditions under which to estimate them are not easy for a novel virus like the Coronavirus. Yet, closely analyzing what is known about the development of this pandemic already helps to promote informed discussions and get a glimpse beyond the current dashboards…
Stay safe, stay curious and don’t forget to check out my code here.