Notes on dimensional modelling
In the other post on Coronavirus reporting we touched briefly the topic of dimensions. In particular we discussed about the temporal dimension. Since dimensional modelling is a basic concept in the data warehouse design it deserves a separate post.
The temporal dimension is commonly used in practically every business and science domain. Time / date is commonly considered as an example of a conformed dimension. In the end the 7th of November will always be the 7th of November and not the 25th of October, irrespectively of the context . In reality, we typically deal with multiple different time/date dimensions – date of order is generally different that date of delivery and different than date of payment. An attempt to combine these 3 facts in one drill-across report using a single ‘date’ dimension will inevitably produce nonsense. These are 3 different, conformed dimensions. In practice it is incredibly important to clearly distinguish different kinds of dates or timestamps.
Know your process
In some cases this distinction is relatively easy, as the difference comes from the business process. In the example above it easily understandable for everyone, that the delivery cannot occur before the order and there may be a remarkable delay between these two events. It is obvious, the delivery date and the order date are two different things. This is an inherent feature of a business process we are describing using analytical tools.
In other cases – and they are often more difficult to handle – the difference comes from how we measure certain quantities or report on events. In the previously described Coronavirus example we were really interested to know the date of infection. But this was not an event we could monitor or measure directly. The event we could easily monitor was the submission of the report of the identified case. There is a significant delay between the two events. Moreover, this delay is not constant and can vary from case to case. It depends heavily on the details of the whole diagnostics and reporting procedure. These details and their influence on the results should be evident for an insider, but very likely will be unknown to the final user of the analysis prepared on the base of the collected data.
How to buy on-line without a single website visit
Let’s consider a process of making a purchase in an on-line shop. Typically, a user visits the website, add some products to the basket, initiates check-out, finalises it and pays. Of course, there may be many different scenarios, but for sake of simplicity let’s restrict our consideration to these five steps. In the ideal world we can register and timestamp the events corresponding to each step. Very likely each of the steps will be handled by a different (micro-)service or even by an external provider. The data about the website visits would be collected and stored differently than the data about payments. And, of course, one might report on the number of website visits per day and, independently, on the number of transactions per day.
It is not difficult to figure out that it takes some time from the website visit to the payment. Hence, the timestamp of the visit event will never be the same as the timestamp of the payment event. It does not take much to realise that a user can browse the site for many hours, then add some products to basket, then leave this basket for another couple of hours or even days, and only then continue with checkout.
Obviously, if we try to combine our facts about visits and facts about transactions in one report using “date” – supposedly a conformed dimension – we will obtain a mess. This mess will probably not be immediately visible on a very high level. But on a specific dimension splits we might see a theoretically impossible situation of transactions without a visit at some dates.
Clearly, there is no single “date” dimension. There are several: visit date, transaction date, e.t.c. We may report the numbers by visit date or by transaction date, but we cannot use them interchangeably and should not mix them together. As a logical consequence, we should clearly distinguish the different types of dates and timestamps by using clear and consistent naming in the data warehouse. But naming is a topic by itself and one day we may cover it separately.
What about other dimensions?
While time and date are important, they are not the only dimension typically used. The other common example of a dimension is “customer”. Every business has customers. But “a customer” may have a different meaning depending who we speak to. In a CRM system one would often store both actual customers and prospects customers. If you talk to a sales representative or an account manager he would use the word “customer” to name both. For finance department “a customer” very likely will be someone they can issue an invoice (or have issued one in the past). For legal “a customer” might be someone we have a valid contract with – so no prospects, and no former customers. Some other department might even define it as “someone, who has bought from us in the last 3 months”.
However, in this case we can define a superset of all “customers”. This superset will contain all the prospects, actual customers, former customers, e.t.c. If we manage to collect all the relevant attributes for such a customer dimension and we have a single primary key to identify an entry – we have a conformed dimension. Despite the fact we use “customer” in many different contexts, we have a single conformed dimension, which we can use to create coherent drill-across reports.
The problem with “customer” is not that it can mean different things, the problem is the data about customers are present in many different systems. In general, different departments, and consequently the systems they use, may need various subsets of data. The data of a particular system may be incomplete, outdated of simply false. Some of the data in system A may contradict the data in system B. Thus, apart from having a single data warehouse definition of what a customer is, we also need a single source of truth about its attributes. But this is a separate topic of Master Data Management.
Don’t expect magic from dbt and don’t expect it will fix all your problems. Instead, expect to get a stable framework that makes your project as simple as possible. The only side effect of simplification I noticed is that I face much fewer problems and the ones I encounter are usually simple to debug. Bear >>
That’s dbt? WOW!
If you’re data enthusiast, it’s definitely worth trying. DBT (data build tool) is relatively fresh (v. 1.2) and open-source tool that perfectly fills the gap in data engineering stack. It’s dedicated for data engineers that love to code and hate to drag-and-drop. Particularly, for analytics engineers that work somewhere between data engineers and data analysts >>