Data warehouses have been around since the 80s. Throughout these years, it has proven its capabilities to support decision-making and business analysis. Data warehouses allow the Integrating of many source systems such as databases, spreadsheets, and flat files. Cleansing and Transformation can be applied to these data after integration and then organizes in a way that can serve the reporting layer.
How did Data Warehouse affect decision-making?
Data warehouses act as a single point of truth for the end users who seek to take decisions based on precise information. The huge amount of data that can give precise answers to this question cannot fit in a regular database, Instead Data warehouses gave users the chance to analyze and query data without affecting the operational systems. It can also keep the complete history and retrieve it in a timely manner as it does not require real-time validation. Looking at the history and using it to predict the trend in the future is now possible with data warehouses.
Before building a data warehouse that can achieve its goal effectively, you need to understand its design process of it which is called dimensional modeling. The Dimensional Model consists of a Fact table and dimensions.
Together in This article, we will go through
- What is dimensional modeling?
- The 4 steps of designing a dimensional model.
- Differences between star schema and snowflake model.
What is Dimensional Modeling?
Dimensional modeling is a design process for the Data warehouse that is different from the standard way of designing a relational model for the operational databases. The goal of a dimensional model is to optimize the database for faster data retrieval and allow the users to query the data to answer analytical questions and support decision-making.
The four steps of dimensional modeling
As stated by Ralph Kimball the father of Data warehouse, the 4 steps of designing a dimensional model are as follows
- Select the Business Process
- Declare the Grain [Level of Details]
- Identify the Dimensions
- Identify the Facts [Measures]
Select the Business Process
Understanding the business need is the first step that is required to implement a successful data warehouse solution. In this step, you will select the business processes that will be included in the data warehouse using the bus matrix to include all the processes to achieve the target of the Data Warehouse.
Bus matrix templates list the required business processes and the involved organization departments
An example of a business process is the Analysis of Passengers’ Experience on Airline Trips.
Declare the Grain
The main question here is, What does the row in the fact table represent?
The Grain identifies the level of detail in the data warehouse. It is advised to start declaring the grains in their most atomic grain. Because if only summarized data is available then you can never provide answers for users who need to dig deeper into the details of the data or aggregate it in different ways.
Declaring the grain is very critical in designing a dimensional model because it defines the representation of each row in the data warehouse. Declaring a wrong grain – Mixed Granularity – reflects in the next steps as it directly affects the design of the fact and dimensions tables, It can also cause inconsistency that will reflect on the BI Reports and will not be sufficient to fulfill business requests
Example: The Individual Transactions made by each passenger.
In this example, the fact table is a transactional Fact table where a row will exist if only a transaction is made by a passenger.
For further clarification, Incorrect grain would be something like this:
Assume you have defined your grain as follows: The Accumulative Transactions made by Passengers in a Year.
Then after designing the Schema, the end-user asks you to know the transactions made by passengers weekly. You will not be able to answer this question because you have previously declared your grain in non-atomic format and aggregated it per year. This means you have ignored a piece of information that was needed by the business users later.
Note: We can have multiple grains in different fact tables, for each grain we can have a different fact.
In this step, we define the entities of interest in our model and represent them as tables that contain descriptive information for our business needs. Dimension tables are usually small in size and reusable in case there is a change that needs to be applied, only this table will be affected.
Asking questions like who? where? when? and what? can help to identify the dimensions tables. It is common to include hierarchy in your dimension tables when needed. Also, it is highly recommended that you use surrogate keys in each dimension table.
In our example, in the case of passenger dimensions could be passenger age, gender, and address. Other dimensions could be destination and source continent, country, city,..etc.
Identify Facts (Measures)
Facts are the business process measures which are usually numerical values that are important to business users. In other words, Facts are the KPIs that the business is interested to learn about. It is important to ensure that all the measures are of the same grain.
In our example, measures could be Avg Flying Hours, Avg Waiting Time, Total No. Of Reserved trips, Total No. of Complaints.
Dimensional modeling Schema Design
When it comes to the level of detail inside each dimension and how to connect facts and dimensions together, here we are talking about schema design of dimensional modeling, and in this area, we have two main options Start Schema and Snowflake Schema, and you can choose which to use based on the following comparison
|Key Difference||Star Schema||Snowflake|
|Schema||Denormalized Schema Consists of a Fact table surrounded by multiple dimensions.||Normalized Schema Consists of a fact table and multiple dimensions that are divided into hierarchies.|
|Usage||Used when dimensions tables are small and all the information fits in it without causing redundancy.||Used when dimension tables are large and suffer from space issues and data redundancy.|
|Number of Join||Few Joins||More Joins|
|Complexity||Low level of Complexity, A single join is required between the fact table and any dimension.||High Level of Complexity, Data is spread between multiple dimensions which required more joins using Foreign keys.|
|Performance||High performance because the number of joins is limited||Slower Performance due to the high number of joins and complexity of the dimensions hierarchy|