Implementation of Next Gen Data Warehouse
Päivitetty: 12. kesä 2019
Building a data warehouse should be started with conceptual modeling because it will create a durable and long lasting base for data warehouse. If the build is based on organization-wide common business concepts, data from different source systems can be integrated. Thus the utilisers of the data warehouse will get additional value. When conceptual modeling is done, designing the data warehouse can begin. Many important things need to be taken account in the design, such as data integration and historization, building in manageable portions and scalability of the data warehouse. All of these requirements are met in one method: Data Vault 2.0 (DV2), where modeling is based on a set of rules. Because of its set of rules, the design can be automated with dedicated tools.
One good tool for automatization is WhereScape, which enables the design in a model driven approach. In this case, a logical model is generated from the conceptual model based on conversion rules entered into the tool. The desired data mapping is added to the logical model from the source system interfaces, after which they are categorized with DV2 -categories. After categorizing, the design and implementation of the DV2 data warehouse can basically be left to WhereScape software.
In the following chapters, the three-layer architecture of a modern data warehouse, namely the conceptual model, the logical model, and automation of the staging area and the data vault (physical layer consists of staging are and DV) will be further elaborated. These will be done in a chronological order according to the implementation.
In conceptual modeling of a data warehouse, concepts - which have data behind them - are being modeled. The goal behind the modeling is to find common organization-wide concepts. The conceptual model is primarily a communication tool between different parties in a data warehouse project, as it shows what is supposed to be brought to the data warehouse or what already is there in a one glance.
In a conceptual model (Picture 1) concepts can be divided into different categories. Masterdata is a register type data, such as a customer, a concept which is common to the whole organization. Contract data is data, which has a clear expiration date, such as customership. Event data contains, for example, a receipt linked to a customer. If event data contains two-level information, for example receipt-row of a receipt, it can be divided into two different concept categories: header - detail. Lastly reference data which is usually codes, information that rarely changes.
When a concept is brought to a conceptual model, it will be attached to others with relations. Relations are usually from one to many, for example customer - receipt, where a customer can have several receipts, but a single receipt is only for one customer.
In a conceptual model an organization wide business key information can be linked to a concept. If the concepts have been modeled to represent the business, the change of a source system will not require substantial changes in the data warehouse. Thus business keys are the most important components of the data warehouse.
When concepts and their relations have been modeled, a logical model of the data warehouse can be derived from the conceptual model. With WhereScape -software a logical model can be directly converted from the conceptual model using the conversion rules (Picture 2). After the logical model is done, a connection will be made to the data in the source systems. Corresponding key fields from the tables in the source system interface will be mapped to all of the automatically generated key fields of the logical model.
In a logical model we need to know what DV2-category each field of the logical model corresponds to. Key fields for the concepts generated in the conversion will be categorised during the conversion on the basis of the conversion rules. Also, for the categorising of the attributes added from the source system, a new set of conversion rules can be made, which can then be run after the import of the attributes. Most common and essential DV2-categories are: business key, unit of work, reference attribute and low (or other) velocity attribute. After the categorising the logical model is finished, so that the staging area can be formed.
A staging area will be generated from the logical model using conversion rules (Picture 3), where one of the concepts of the logical model corresponds a table in the staging area. Fields in the table are formed from fields defined in the logical model and fields essential to DV2-modeling, such as hash fields and timestamps. There are two types of hash fields: those, which are formed from business keys (_hk), and those which are used in historization (dss_change_hash -fields). For codes, or reference data, fore mentioned fields will not be made. Reference tables can be left in the staging area on the modeling level.
Staging area can be already created into the database at this point. WhereScape -software will do it automatically, generate table creation phrases and run them in the database. In addition to the creation phrase, the software also creates loading scripts and thus there is no need for manual ETL/ELT -work.
Data Vault 2.0
If the staging area is created correctly, we can go forth on forming a DV2-model (Picture 4.). When forming the model, a hub table and a satellite table are generated from one of the tables of the staging area, if attributes have been set to the business key. When concepts - or two business keys - have relations, a link table is generated. A link table can also show the relations of several different keys.
From the logical model of picture 2, the tables for DV2-model are generated as follows:
From customer -masterdata: hub_customer and sat_customer are created
From customership contract data: hub_customership and sat_customership are created
From the receipt event: hub_receipt and sat_receipt are created
From the row of a receipt event: hub_rowofreceipt and sat_rowofreceipt are created
From a customer-customer relation: link_customership is created
From customer-receipt relation: link_receipt is created
From receipt - row of a receipt relation: link_rowofreceipt is created
Tables to the DV2-model are not created from the City and Store -reference data. The tables are already created in the staging area model
When DV2-model is created, its tables and loading scripts are created automatically, similarly as the staging area. After this the data warehouse has been modeled and implemented regarding to all of the concepts of the conceptual model.
Example case of work load of implementation
A new important data entity was requested to be added to the data warehouse, which was implemented with the new generation architecture. This new entity included 4 contractual, 2 event, 1 masterdata and a dozen of reference data concepts. The implementation could begin when the source system, MS Azure datalake store, contained all of the necessary data and it was loaded to the load table area. The area was used in the implementation in mapping the data from the source system to the logical model. Before this the logical model was generated from the concept model.
Mapping the logical model took the longest, because it is the only phase where manual work cannot be avoided. Someone still has to manually input what the context behind every field is. Still the total work time was a lot less than in traditional data warehouse projects. The work time for each portion was as follows:
Creating the conceptual model to WhereScape and generation of the logical model: 2h
Data search from Azure Datalake store to the platform: 3h
Mapping the data to the logical model: 15h
Generating the staging area and DV2-area: 1h
Transfer to the physical implementation (automated creation of the database structures and loading scripts: 1h
ELT/ETL -work: 0h
Comparing to traditional architecture, for example 3NF-modeling and Informatica-ETL-tool, is unfair, because automatization of the new generation architecture takes care of the mechanical work and does it in mere seconds. Something that use to take weeks to get the same data entities to data warehouses.
If the data in source systems is integrated and necessary information can be recognized, the new generation data warehouse architecture guarantees that the implementation of the data warehouse can be done with agility and incrementally on the basis of conceptual models. In addition, automatization will reduce costs significantly by minimizing the manual work load.