On Coronavirus Analytics and Reporting
The evolution of the COVID-19 coronavirus epidemic triggered an enormous interest in crunching the available data. There are plenty of data on this topic readily available, which makes this task to appear particularly easy.
Both official institutions and individuals have created a number of various visualisations, of various quality and various practical usefulness. I must admit that after having a look at a couple of them I was quite disappointed. Most of them, while often visually appealing, has significant shortcomings. We will not discuss the details here; those interested in the topic can find some critical thoughts here. The most important take is that the most of these visualisations do not really facilitate any decision making process. Even if they present trustworthy data they often do it the wrong way. Typically, they present answer on “what has already happened?” or “how does it looks like now?” – which is not yet bad as such. At the same time they make it difficult to find out “how the situation has been changing?”, not to mention the most important aspect: “what will / can happen in the future?“.
Of course, many attempts has been made to answer the latter question. The availability of the various data made many amateurs, both when it comes to epidemiology and handling data, to build and publish their “models”. This activity was particularly visible at the early stages of the epidemic. The “models” were very often not much more than an exponential regression based on few available data points. Very simple, there is nothing which could go wrong, right? Apparently, many people done seemingly the same thing, on ostensibly the same data and they… got totally different results. The only common thing was that the curve was rising quickly. But this was visible even without a “model”. We will mercifully not point to any examples. In fact, the most outrageous ones does not seem to be easily available in the Internet any more.
We will not try here to show another visualisation – there is enough of them. Neither we will attempt to build another predictive model of the course of the epidemic. While we are experts in data, we definitely prefer to leave disease spread modelling to professionals in epidemiology. Instead, we start with something very simple: the basic facts and metrics. Why? Because they are the cornerstone of every business analytics, not only the coronavirus ones.
Facts, dimensions and metrics
In case of coronavirus disease the atomic event is an infection. In the other words infections are our facts. Now, what are the most basic metrics on that facts? Of course, it is the number of infections and we will focus on that one. Each of the infections occurs at some point in time, at some geographical location and to a person of a certain characteristics (age, gender, profession, …). These are the potential dimensions. The important one here, and the most common in nearly every business analysis, is time. That is why we take that one as an example.
Immediately we can conceive the first metric: the total number of infections or more precisely the cumulative number of infections to date. This is the metric the most frequently shown in all tables and visualisations. Typically, the ‘date’ being ‘today’. But if have a closer look at this metric we would noticed that this is similar to the total number of a website visits, or the total number of customers or the total amount of sales (to date). These are all vanity metrics. They all will always grow, by definition. At the same time they carry very little useful information on how are we really doing.
Thus, we are not that much interested in the cumulative number of infections to date. We would rather look at the number of (new) infections per day. The business equivalents would be, for example, the number of website visits per day or the daily amount of sales. From this metric we will be able to easily see what is the trend (increasing, stable, decreasing) or if Monday is always the worst day of the week. This one is useful.
So, let’s have a look at how this looks like on a chart. As said previously the data are easily available, so are the charts. One of the most popular services showing curated data and some charts is Worldometer. Since we are located in Berlin let’s have a look at the data concerning Germany. The official source of German data is Robert Koch Institute. There are also a lot of other information and scientific studies on their website, so if you understand German it is well worth reading.
The Worldometer’s Germany Daily New Cases is updated daily and there is no history provided. At the time of writing it looked like that:
The respective RKI charts are given in Daily Situation Reports (updated daily, archive available). The one from the time of writing looked like that:
Not all dates are equal
It does not take much time to realise they are substantially different. It is not just a matter of different scale on Y axis and different graphical representation. They looks like they were reporting on completely different data. But this is not the case – Worldometer takes the official data from RKI. Beyond any doubt, the source data are the same.
In both cases we use seemingly the same metric: the number of (new) infections aggregated and charted against seemingly the same dimension: date. If we have a closer look we realise that for Worldometer “date” means “date of reporting”. At the same time RKI uses “date of onset of symptoms” (if known) or “date of reporting” (if date of onset of symptoms is not known). In fact, we have here two distinctly different dimensions, disguised under the same name: “date”. This simple observation is of crucial importance and we will discuss it in a separate post.
To be exact the values on the RKI chart are plotted against two different dimensions, which are marked on the same X-axis. This is a very rare thing to do and normally we would discourage such attempts. However, in this case such a presentation serves a purpose.
Understand the process
To understand why it is important we first need to understand the measurement and reporting process. In the ideal world we would like to report on the date of infection. Obviously, in reality we do not know this date. In science, if we cannot measure or calculate a certain value directly we can try to estimate it. So, let’s think what would be the best estimate of it.
The sequence of events in the whole infection and diagnose process (largely simplified for the purpose of this post) may be as follows: one gets infected, after a few days develops symptoms, the samples are taken, then they are delivered to the lab, processed in the lab, the results are obtained, the result are reported through the official channels. As we understand, in general case the symptoms may be developed on any stage of the sequence, theoretically even after the case is officially reported, or not at all. Those interested in details may read more on RKI website.
What is important for our further consideration is that there are multiple steps and consequently multiple potential timestamps to be used. The delays between the steps are expressed in hours or days, are variable and are significant from the perspective of the dynamics of the process we are trying to depict. In the other words, the case reported today may have been infected 3 days ago or 10 days go. If we look again at the simplified sequence of events above it should be clear these variable delays come mostly from the procedure of diagnosing and reporting. When we realise that fact we can try to explain the sudden spikes and the weekly variations on the Worldometer’s chart.
As we can see the date of reporting is the very last date in the whole process. It is also very different than the date of infection we are looking for. The closest to it is the date of the onset of symptoms. Logically, RKI takes that one as the best estimate of the former. As the result we get the smoother chart, without the spikes or drops caused by irregularities in testing and reporting process. It much better depicts the process we are interested in. The problem is in many cases the date of onset of symptoms is not known, or cannot be given. This cases cannot be ignored or omitted, hence they are charted by the only certain and available date – the date of reporting. As mentioned above we are sceptical about the idea of mixing two distinct dimensions on the same report and we would normally discourage such practice in any business context. At the same time we clearly see the advantage of using the best estimate whenever possible.
The way of charting taken by RKI has one more important feature: it can change retrospectively. One of the cases reported today might have the onset of symptoms 8 days ago, in the other one the symptoms might (in theory, at least) appear in 2 days from now. In both cases we would need to update the whole chart, not just the most recent date.
As the illustration here is how the RKI report has changed over 1 week, from 2020-04-21 to 2020-04-28. Retrospective changes in the recent days are remarkable and the adjustment of the numbers going back as far as few weeks can also be noticed.
We have seen it already
Such things are not unusual, it can happen in many applications that the events are reported with a significant delay and not necessarily in order they occur. Many modern data processing systems have built-in features which allow to handle such late arrivals easily. However, the consequence is that the report for the last month done today will be substantially different than the same report done in a week time from now. Many of business stakeholders find this uneasy to deal with such a volatility in their reporting. This often leads to picking a simple and stable “Worldometer-like” reports over a more adequate, but also more elaborate and volatile “RKI-like” ones. Even if the latter ones provide a better base for a data driven decision making.
As shown above, even in a relatively simple data analysis there are multiple decisions to be taken. Depending on how they are taken and how the whole data collection and analysis process conducted the reports of (seemingly) the same metrics, based on exactly same data may differ significantly. Clear definition of the relevant metrics and dimensions is the starting point for any business reporting. If done right – it can greatly facilitate further decision making process. If neglected or done carelessly – it will lead to discrepancies, misunderstandings and drawing false conclusions.
Don’t expect magic from dbt and don’t expect it will fix all your problems. Instead, expect to get a stable framework that makes your project as simple as possible. The only side effect of simplification I noticed is that I face much fewer problems and the ones I encounter are usually simple to debug. Bear >>
That’s dbt? WOW!
If you’re data enthusiast, it’s definitely worth trying. DBT (data build tool) is relatively fresh (v. 1.2) and open-source tool that perfectly fills the gap in data engineering stack. It’s dedicated for data engineers that love to code and hate to drag-and-drop. Particularly, for analytics engineers that work somewhere between data engineers and data analysts >>