Dimensional Modeling … Design Methodology for Analytics Oriented Data Warehouse | Data Warehouse
Data warehouses has been around since the 80s. Throughout these years, it has proven its capabilities to support decision making and business analysis. Data warehouses allow Integrating many source systems such as databases, spreadsheets, and flat files. Cleansing and Transformation can be applied to these data after integration then organizes it in a way that can serve the reporting layer.
How did Data Warehouse affect decision making?
Data warehouses acts as a single point of truth for the end users who seek to take decisions based on precise information. The huge amount of data that can give precise answers to this question cannot fit in a regular database, Instead Data warehouses gave the users the chance to analysis and query data without affecting the operational systems. It can also keep the complete history and retrieve in a timely manner as it does not require real time validation. Looking at the history and use it to predict the trend in the future is now possible with data warehouses.
Before building a data warehouse that can achieve its goal effectively, you need to understand the design process of it which is called dimensional modeling. The Dimensional Model consists of Fact table and dimensions.
Together in This article, we will go through
- What is dimensional modeling?
- The 4 steps of designing a dimensional model.
- Differences between star schema and snowflake model.
What is Dimensional Modeling?
Dimensional modeling is a design process for the Data warehouse that is different from the standard way of designing a relational model for the operational databases. The goal of a dimensional model is to optimize the database for faster data retrieval and allow the users to query the data to answers analytical questions and support decision making.
The four steps of dimensional modeling
As stated by Ralph Kimball the father of Data warehouse, the 4 steps of designing a dimensional model are as following
- Select the Business Process
- Declare the Grain [Level of Details]
- Identify the Dimensions
- Identify the Facts [Measures]
Select the Business Process
Understanding the business need is the first step that is required to implement a successful data warehouse solution. In this step, you will select the business processes that will be included in the data warehouse using the bus matrix to include all the processes to achieve the target of the Data Warehouse.
Bus matrix templates list the required business processes and the involved organization departments
An example of a business process is Analysis Passengers Experience in Airline Trips.
Declare the Grain
The main question here is, What does the row in the fact table represents?
The Grain identifies the level of details in the data warehouse. It is advised to start declaring the grains in its most atomic grain. Because if only summarized data is available then you can never provide answers for users who need to dig deeper into the details of the data or aggregate it in different ways.
Declaring the grain is very critical in designing a dimensional model because it defines the representation of each row in the data warehouse. Declaring a wrong grain – Mixed Granularity – reflects in the next steps as it directly affects in the design of the fact and dimensions tables, It can also cause inconsistency that will reflect on the BI Reports and will not be sufficient to fulfill business requests
Example: The Individual Transactions made by each passenger.
In this example, the fact table is a transactional Fact table where a row will exist if only a transaction is made by a passenger.
For farther clarification, Incorrect grain would be something like this:
Assume you have defined your grain as following: The Accumulative Transactions made by Passengers in a Year.
Then after designing the Schema, the end-user asks you to know the transactions made by passengers weekly. You will not be able to answer this question because you have previously declared your grain in non-atomic format and aggregated it per year. This means you have ignored a piece of information that was needed by the business users later.
Note: We can have multiple grains in different fact tables, for each grain we can have a different fact.
In this step, we define the entities of interest in our model and represent them as tables that contain descriptive information for our business needs. Dimension tables are usually small in size and reusable in case there is a change that needs to be applied, only this table will be affected.
Asking questions like who? where? when? and what? can help to identify the dimensions tables. It is common to include hierarchy in your dimension tables when needed. Also, it is highly recommended that you use surrogate keys in each dimension table.
In our example, in the case of passenger dimensions could be passenger age, gender, address. Other dimensions could be destination and source continent, country, city,..etc.
Identify Facts (Measures)
Facts are the business process measures which are usually numerical values that are important to business users. In other words, Facts are the KPIs that the business is interested to learn about. It is important to ensure that all the measures are of the same grain.
In our example, measures could be Avg Flying Hours, Avg Waiting Time, Total No. Of Reserved trips, Total No. of Complaints.
Dimensional modeling Schema Design
When it comes to the level of details inside each dimension and how to connect facts and dimensions together, here we are talking about schema design of dimensional modeling, and in this area, we have two main option Start Schema and Snowflake Schema, and you can choose which to use based on the following comparison
|Key Difference||Star Schema||Snowflake|
|Schema||Denormalized Schema Consists of Fact table surrounded by multiple dimensions.||Normalized Schema Consists of fact table and multiple dimensions that are divided into hierarchies.|
|Usage||Used when dimensions tables are small and all the information fits in it without causing redundancy.||Used when dimension tables are large and suffer from space issue and data redundancy.|
|Number of Join||Few Joins||More Joins|
|Complexity||Low level of Complexity, A single join is required between fact table and any dimension.||High Level of Complexity, Data is spread between multiple dimension which required more joins using Foreign keys.|
|Performance||High performance because number of joins is limited||Slower Performance due to high number of joins and complexity of dimensions hierarchy|