Birth of Next Gen Data Warehouse
Päivitetty: 12. kesä 2019
The development of data warehouses started a few decades ago. Through the years multiple different views of designing and usage of data warehouses have emerged. The designing has traditionally been divided into Kimball or Inmon, and uses vary from organisation-wide centralised data integration solutions to system-specific report databases.
Traditional Data Warehouse
In Kimball’s architecture the design starts from forming the data warehouse from datamarts. Datamarts will usually be modelled in a star schema, so that they will support reporting tools as well as possible. A star schema can also be called a snowflake model, where different stars attach to each other with conformed dimensions. Before data is loaded into datamarts data is staged into a staging area. Staging area can easily become too complex when data from different sources will be combined to a format that is suitable for reporting (Picture 1). Then the storage and history of data is on datamarts’ shoulders.
Inmon warehouses are usually databases of third form which largely follow the database structure of the source system. Then combining data in the data warehouse and data pipeline designing can be quite difficult, because the relations between will define the loading order of the data. The basic idea of the Inmon approach is that a data warehouse is not primarily a reporting database, but a place to store and integrate data (Picture 2). This is why a new layer from views and database tables will be made on top of the data warehouse.
A New Generation of Data Warehouses
Both of the traditional data warehouse architectures were a starting point to the newest and the only design method, which was meant for data warehouses. The purpose was to gather the pros and cons from the previous architectures, discard the cons and create a new method for design. With this method data from different source systems can easily be integrated and warehouse can be built incrementally and agile. Data will stay as it was loaded to the warehouse, traceability is preserved and data can be loaded from different sources simultaneously, without any effort to sort the order of loading. Data Vault and its newest version (2.0) was created by Dan Linstedt.
Data Vault is a design method which consists of several rules. The most important rule for design is to identify the business keys which are used to model hub tables. When hubs are being modelled, a conceptual model should be used as a starting point. If the conceptual modelling is done well, the tables for the Data Vault -database can be created automatically (with modern ETL-tools). Other table types for Data Vault include link tables, satellite tables and reference tables. Link tables describe the relations of the business keys - the hubs - and satellite tables contain the attribute information and history (Picture 3.). Reference tables contain reference information such as code sets.
With Data Vault data warehouses have generally gone in the Inmon way, where data warehouse layer is built with Data Vault and on top of that the datamarts (or infomarts) are built, where the data will be used from by reporting systems etc. At the same time ELT (extract, load, transform) has come from ETL (extract, transform, load). The difference in these abbreviations is that in ETL the data is extracted first, transformed into a usable format and loaded to the data warehouse after that. In ELT, however, the transformation of the data will be delayed as much as possible when data is brought from the source system to data warehouse and the transformation will be made when taking forth from data warehouse, for example in Data Vault’s case to the Business Data Vault or to infomarts.
Building A New Generation Data Warehouse
Data warehouse projects made with traditional methods have a bad reputation for a reason. Usually they drag on too long and sometimes development projects never end up in production, and if they do the data warehouse is not used enough. Luckily with Data Vault and ELT automation, the evolution of data warehouse is finally at the point that agility, scalability and advanced automation can be talked about data warehouses.
Data Vault works perfectly for data warehouse modeling and transferring data to data warehouse is a lot simpler than with data warehouses designed with traditional methods. Often most of the work with data warehouse and is ELT -work, because almost everything has to be done by hand. Of course, in the case of Data Vault, it has already been possible - at least theoretically - to reduce this work due to the repetitive nature of ELT -work. Mechanical work can be automated with modern ELT -tools, when the ELT -work portion of the overall workload will drop several tens of percentages, not to mention maintenance and development costs.
At the same time project methods have developed and nowadays agile methods are the cornerstones of a successful project. The agile method chosen as such is quite irrelevant as long as it is agile. In practice, small enough implementation sets will be worked on and if some are found to be faulty, they will be canceled and fixed (Picture 4). “ Fail quickly” can be a good rule for the implementation phase.
Why design and implement data warehouses with a new generation method? If one thing has to be chosen, it is the lead times and their speed. Here is a practical development example of a data warehouse implementation with a similar architecture:
The customer's existing modern data warehouse lacked the data needed for reporting. The data existed at the source, so obtaining data from the reporting interface required data search from the source system (an existing interface), export to the Data Vault and from there to the datamart.
Work steps briefly:
1. Adding new fields to the interface of the source system
2. Field mapping to the existing logical model
3. Automated generating of staging area and Data Vault on the basis of existing conversion rules of Data Vault
4. Adding necessary changes to datamart tables and loads.
5. Testing and loading of changed ELT -loads so that the data can be seen by reporting
This took a total of two hours of work - with all of the mistakes and their fixes made - after which all of the data was usable by the reporting tools.
In practice all of the changes and additions to the data warehouse are done basically in the same way as in the example above. The architecture is beneficial to technical employees (manual work is minimised and the need for testing decreases) as well as management and other utilisers of the data (the data generated by the organisation's operational systems is made faster to be used). This will inevitably lead to savings in the long run.