What is a data warehouse?
What is a data warehouse?
The concept of a data warehouse is quite old, it originates from late 1980’s. Yet, in our practice we find out the term “data warehouse” is often misunderstood. The presence of a similar term, a data lake, does not makes things easier and the two concepts are frequently confused or mixed.
There are two canonical definitions of a data warehouse:
„A data warehouse is a copy of transaction data specifically structured for query and analysis.“Ralph Kimball
„A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process.“Bill Inmon
Both of these definitions are very concise and rather high-level. They say what properties a data warehouse have or what conditions a data store should fulfil to be a data warehouse. In fact they do not directly describe how a data warehouse looks like. Nor how to build one. But if we read them carefully we will know precisely enough what is and what is not a data warehouse.
What is not a data warehouse
If we read again the above definitions we can see they both define a data warehouse as a kind of a data collection (“copy of transactional data”, “collection of data”). So, certainly it is not a process of collecting, transforming or analysing the data. It is not a tool for data integration. Neither it is a reporting tool. It is a data storage and this is the key property.
But not every data storage is a data warehouse. This data must serve a specific purpose. Kimbal says they are for “query and analysis”, Inmon describe it as “support of management’s decision making process”. Obviously, a collection of data not suitable for these purposes cannot be called a data warehouse. A folder with scanned invoices from the last 5 years, although it certainly contains valuable information, is completely unsuitable for analysis. An S3 bucket with a collection of logs or JSON-structured website event details are hardly usable for such purpose. None of these data stores can be considered a data warehouse.
Even a transactional database, holding all the transactional information in the neatly designed 3rd Normal Form, does not fulfill our conditions. It is made for operating some system or application and storing (business) transaction data. It is specifically structured for recording transactions, but not for query and analysis. Although technically one can run queries against a transactional database, it is not designed for the purpose of running analytical queries. And it contains the transactional data, not a copy of transactional data as Kimball defines. Inmon’s conditions of non-volatility and integration of data are not fulfilled either.
Getting to the details
An important feature of a data warehouse is that it is “subject-oriented”. The data should be organised around subjects they describe or refer to, rather than by systems they originate from. Kimball et al. in their classic Data Warehouse Toolkit Classics wrote the key point is to identify and describe the business processes. These processes, not the organisational structure of the company or the architecture of computer systems used, impose how the data should be structured.
For example, the typical sales process is composed of several steps: making an order, payment, shipment, etc. Each of these steps may be handled by a separate organisational unit or even by an external provider. Each of them would use various databases or information systems to do the job. The role of the data warehouse is to describe the complete process, not a separate components. The organization of the data revolves around the process, not departments or origins of the data.
The fact the data come from various sources brings us to the next feature: integrated. It does not only mean the multitude of sources. More importantly it means the data coming from different systems must be aligned and unified. This includes introducing a single measurement system, uniform data formats, a consistent naming convention. The conflicts between the data originated from different sources should be resolved.
The “non-volatile” attribute entail a permanent storage of the data. So, no system which collects, transforms or generates the data “on the fly” can be considered a data warehouse, according to the definition. Moreover, the non-volatility means the data generally should only be added, but not removed from there. That is logical – the data depicts some business processes at a given point in time. Even if the state of the process changes in the future the data collected so far carry the facts about the past states. If we want to use them for analysis of the history of such process or its changes in time we should not delete or alter such information.
The restriction of non-altering information is not absolute and, contrary to popular belief, is not the same as non-altering the underlying data. It does not entail the read-only access to the data, apart from loading them. It also does not mean the old data cannot be overwritten by the new data. It certainly does not mean the execution of ALTER statement is prohibited. However, any of such operations should not destroy the historical information or be well justified. Classic examples of perfectly legitimate overwriting or altering the data are handling Type 1 and Type 2 slowly changing dimensions.
A snapshot of the transaction data, properly integrated and structured fulfils all the above discussed conditions. At the first sight it matches the complete Kimball’s definition. But what would be the real usefulness of such a snapshot for analysis or decision making process? The analysis would be limited to this specific time the snapshot was taken and (maybe) the history. As time passes the utility of such snapshot would quickly degrade. It is obvious, that for making decisions everyone would like to have the most up to date information. For the data warehouse to serve it purpose its content must be regularly and frequently updated.
Certainly, if one attempts to answer the same question based on two different sets of information the result may be two very different answers. If the data in the data warehouse changes in time then the results of the queries will generally depend on the time they are executed. Technically, the data warehouse is a time-variant system. In practice it means it is not a mere store of past data, but it is live, it changes.
The time-variant attribute from Imnon’s definition is commonly misinterpreted as “referring to historical data” or “focusing on change in time”. Of course, the data warehouse stores historical data and stores them on much longer horizon than the transactional systems do. The analysis of changes in time plays an important role in decision making processes. It is often how the collected data are used. But time-variant really means the output changes in time. Which is because the data in the data warehouse are not static – they are being added and updated.
Is data warehouse a database?
None of the definition mention explicitly a database. Kimball only says the data must be “specifically structured”. Until recently it was fairly obvious that if the data are stored for the purpose of analysis or decision support they must be stored in a database. In fact both Kimball and Inmon in their methods described the implementation of a data warehouse in a relational database.
However, their original works on this topic are from the 1990’s, with some updates in 2000’s. These were pre-Hadoop times and there were practically no other storage options suitable “for query and analysis”. Today it would be possible to complete the task even without a database, using alternative storage options and query engine. However, implementations of a data warehouse with a relational database (like Oracle or PostgreSQL) or a Massive Parallel Processing (MPP) database (like Amazon Redshift, Netezza, Vertica or Greenplum) are by far the most common.
Today, with the development of Hadoop, cloud and related tools there are more options. There have been many data warehouse implementations with data stored as files on HDFS. Hive or Impala can then be used as the query engine to provide SQL-like experience. Similar tools exist in public cloud – Amazon Redshift Spectrum or Athena being just two examples.
Irrespectively of the storage system and query engine used the principles do not change: the data must be specifically structured for query and analysis. Dumping on HDFS or S3 no matter what and no matter how will result in troubles and not a data warehouse.
Not all databases are being equal
Although it should be obvious, it is worth repeating that not every database is a data warehouse or can be treated as such. Even if it contains a lot of historical data, about various topics. A distinction must be made between an On-Line Transaction Processing (OLTP) and On-Line Analytical Processing (OLAP) systems. The first are made to quickly execute a predefined set of operations, each of them acting on a single record or at most a few of them. The typical operations are writing single records or updating the existing ones. OLTP systems may handle a very large number of such operations, but each individual operation is rather simple.
In contrast OLAP systems, in which category a data warehouse falls, are made for analysis of the large sets of historical data. The queries are relatively infrequent, but often they operate on millions of records. They are almost exclusively read queries and hence the data warehouse should be optimised for read access. The data load processes typically write whole batches of data rather than single records.
The list of differences is not complete, but it should already be evident these two kinds of databases must be designed in a completely different way. No wonder there exists Relational Database Management Systems (RDBMS) targetted for OLTP or “general purpose” applications and those made specifically for data warehousing. Mixing the two purposes or running analytical queries directly on the OLTP database (even on the “read-only” replica) cannot lead to good results, especially at a scale.
But how to build it?
So far we discussed only what a data warehouse is or what it is not. Apart from a few hints we did not give a clear recipe on how such a system should be built. Even a very brief guide would have to be many pages long and it would only cover the basics and the most common concepts. Moreover, there are 3 distinct methodologies for tackling the task, and there are also specific system- and tool-dependent variations of them.
Those interested in the details should refer to the classic materials:
- Data Warehouse Toolkit Classics by R. Kimball et. al
- Building the Data Warehouse by W.H. Inmon
- Data Vault Series by D. Linstedt
Of course, you can also contact us and we will be happy to advise and assist you.
The term “data warehouse” certainly appeared in “An architecture for a business and information system” paper in 1988, but very likely had been in use long before.
Even Oracle in their, otherwise very good, Data Warehouse Guide makes this mistake.
Don’t expect magic from dbt and don’t expect it will fix all your problems. Instead, expect to get a stable framework that makes your project as simple as possible. The only side effect of simplification I noticed is that I face much fewer problems and the ones I encounter are usually simple to debug. Bear >>
That’s dbt? WOW!
If you’re data enthusiast, it’s definitely worth trying. DBT (data build tool) is relatively fresh (v. 1.2) and open-source tool that perfectly fills the gap in data engineering stack. It’s dedicated for data engineers that love to code and hate to drag-and-drop. Particularly, for analytics engineers that work somewhere between data engineers and data analysts >>