Dimensional Modeling |Part 1: Introduction and Fact Types
Dimensional modeling is one of the data modeling techniques used for designing the data warehouses, It also considered a suitable technique for representing analytic data, because it understandably delivers data for users and is optimized for query performance which increases the data retrieval speed.
Normalized databases are very useful in transactional processing because any data manipulation operation hits the database in one place due to table normalization. But the normalized form is complicated for BI queries and data warehousing (read-intensive), due to normalization the query performance isn’t optimal because when we need to retrieve data from multiple tables to satisfy user need we have to write many join queries which will overwhelm the performance and retrieval time. Fortunately, Dimensional modeling addresses this problem and solves it with its de-normalized format.
Dimensional model is created using the normalized system information itself, and it’s all about de-normalizing the 3NF system to improve query performance and increase end-user understandability.
Benefits of Dimensional modeling
- Maximize understandability to the end-user.
- The de-normalized model improves query performance.
- Easily handle new data.
There are two key components of dimensional modeling: Fact and Dimensions tables.
So now We can say that dimensional modeling is a special data model that can be considered as an alternative to a 3NF model, a Model that consisting of ER information itself, but combines the data in an abbreviated form that makes the model more understandable to perform the queries efficiently, and has the flexibility to change.
The fact table is the main table in our dimensional model which Located at the center of our dimensional schema and surrounded by dimensions. Fact table stores the measurements resulting from an organization’s business process events.
The term fact represents a business measure.
Each row in a fact table corresponds to a measurement event. Ex: Every row in our fact table could be a product sold in a transactional process. Facts Defined by their grain or a specific level of detail.
Grain: Is the level of detail or the depth of the information that’s stored in the data warehouse.
Grain answer this type of questions:
- Do we store all products or specific categories?
- We will use data from week or month or year?
- We will hold sales per day or per product or per store?
Only facts consistent with the declared grain are allowed.
All the rows in the same fact table must be at the same grain, Different grains must not be mixed in the same fact table, For example, we cannot have the same Fact table that has measures about sales and about use complaints, these are two different grains cannot co-exist in the same Fact table.
Dimension tables are the Complementary factors of the fact table. Dimension tables stored the context related to the business process measurement. It answers the questions (who, what, where, when, how, and why) to cover all aspects of the business process.
Example: When the user needs to see all sold products from a specific store, We should have this store as a dimension attribute in our dimensional model. Another Example: If the user needs to calculate the total profit from some products in a specific year, we should have this year and products as dimension attributes.
Dimension table defined by a single primary key which ensures the referential integrity with the fact table to which it is joined.
All fact tables have foreign keys that refers to the dimension tables primary keys to easily connect them to produce specific measure.
Each dimension is an entry point to the fact table, which gives us great handling of the complexity of queries.
If our business process is calculating the sales amount of each product in each store on a daily basis. So we have two aspects of the process:
- Measures: Quantity Value, Value-added tax, and Total order value.
- Business process details.
- What was the product? (product name)
- When did the transaction happen? (Date)
- Where did the product been sold? (Shop)
- Who did the people that create the transaction? (Customer) and (Employee)
According to the above details, we can say that:
- The measurements or facts (Quantity Value, VAT, and Total order value).
- The Dimensions that answered the questions about the business process (Date, Employee, Customer, Shop, Product).
In this example, Fact grain set at a single selling transaction.
The Dimensional modeling for this business process.
Types of Fact Tables
There are three main types of fact tables:
- Periodic Snapshot
- Accumulated Snapshot
Transaction Fact Table
The transaction fact table is the most basic and the simplest type, In the Transaction fact table, every row corresponding to a measurement event at a point in space and time which means a grain set at a single transaction. The row is added only if there’s a transaction is happened by a customer or for a product.
Row in Transaction fact table represents transaction and it dimensions.
Example: If we have a retail store, on Sunday we sold 40 items, and on Monday we sold 15 items, So on Sunday there are 40 transaction rows had been added to our fact table and on Monday there are 15 transaction rows had been added to our table, There are no aggregate values, we are storing the transactions data.
An example of a transaction fact table, the user wants all selling transactions that are happening in the stores, Our dimensions will be the date or time of transaction -> (when), store -> (where), product -> (what), and customers -> (who).
Every row in our transaction fact table is corresponding to a transaction, Let us see the first row in the above table Transactional_Sales_Fact, The row reflects transaction that happened by customer 157 when he bought product 59 from store 3 at time 13 and the item cost was 33$ (the fact), So our row covers all aspects of the single transaction. And like we said before every foreign key in our fact table refers to one of the dimension tables.
Because of low granularity, we can monitor detailed business processes. But because we have rows for each transaction that happened, It causes performance issues due to the large size of data.
Periodic Snapshot Fact Table
In the periodic table, We have lower granularity which means that the row in the periodic table is a summarization of data over a period of time (day, week, month, etc). Our grain here is periodic data summarization, not single transactions. It helps to review the aggregate performance of the business process at intervals of time.
Example: If we want to know the quantity that been sold from specific product through the last week. Our grain is a week.
Rows in Periodic table represent performance of an activity at the end of specific period.
An example of Periodic Snapshot, If the user asks for the total quantity of each product that has been sold from our stores in month grain. It’s obvious that the dimensions are items(what), stores(where), periods of time(months, grain, when).
Now every row in our Periodic fact table is corresponding to summarized measurements of product, Let us see the first row in the above table QuantityFact, The row reflects the Total quantity that been sold of product 5 from store 12 at time 4.
Because of summarized data, our performance now is better than the transaction fact table. But now we have higher grain, So we loss the detailed business processes that we had in the transaction fact table.
Accumulated Snapshot Fact Table
In Accumulating snapshot fact table, Row represents an Entire process, which means that our row corresponding to measurements that occurred at defined steps between the beginning and end of the process. we use it when users need to perform pipeline and workflow analysis like Order fulfillment.
Row in Accumulating snapshot represents an Entire process
There is a date foreign key in the fact table for each critical activity in the process.
An example of Accumulating snapshot fact table
Order fulfillment covers the complete process from when a sale process takes place all the way through delivery to the customer. So here we have multiple activities in the process, First when we receive an order from a customer, then send the order to the inventory to organize it, then move the order to the shipment activity, and finally, the customer receives his order.
In this case, we have many activities each of them has a predictable start date and end date, So row in our accumulating snapshot fact table, corresponding for instance to a line on an order, Row is inserted when the order line is created. As workflow progress occurs, the accumulating fact table row is revisited and updated. This means we first create our row with empty attributes, then when the activity is finished we update the date in the fact table and repeat till the pipeline or our order line is finished successfully.
In the table rows below, We will use real values for dateKey instead of IDs (keys) to make it more clear and easy to be understood, but in the real-life you should use the IDs that refer to the Date Dimension of each date of them, and get the values through it ( Like we did in the above examples).
After placement of the order
After organizing the order to be shipped
After shipping the order
After the order is received by the customer
If we look at the process progress we will see that we have a dateKey for each activity that is set to NULL till the activity is achieved, This is the explanation of what we said above “there is a date foreign key in the fact table for each critical activity in the process”. And we have the statue to keep track of the process progress.
Accumulating Snapshot fact table helps us in complex analysis, workflow, and pipeline process, on another hand It needs high ETL process complexity.
|Row||Transaction and It dimensions.||summarized data over a period of time.||Entire process activities.|
|Granularity||Lowest granularity 1 row / Transaction||Higher than transaction.1 row / period||Highest granularity 1 Row / Entire process|
|Table Size||Largest||Smaller than Transaction||Smallest|
|Example||Sales amount of products on a daily basis.||Total sales amount of product through May.||Order fulfillment|
Factless Fact Table
Factless facts are fact tables that haven’t any measures, It only has a foreign key for each dimension. We can say that Factless fact is only an intersection of Dimensions Or A bridge between dimension keys.
This fact table shows the attendance of students, There are no measurements here, Only the foreign keys that refer to dimensions.
Types of Measures
There are 4 main types of Measures in Dimensional modeling:
Let us demonstrate every type of them.
In the examples that we would mention below for each type, We would use real values for each dimension instead of IDs (keys) that refer to the dimension table to make it more clear and easy to be understood, But in the real life you should use the IDs (Keys) that refer to the Dimension tables, and get the value of it attributes through it (Like we did in the above examples EXCEPT accumulating snapshot).
Additive measures are the most flexible and useful measures because they can be summed across any of the dimensions associated with the fact table. If we have 4 Dimensions associated with the fact table, Additive measures can be summed through 4 of them.
Example Our fact table represents the sales amount of each product in each store on a daily basis. Our dimensions (Store, Date, Product)
Here, Sales_amount measure can be summed through all dimensions (Store, Product, Date), We can get sales_amount per specific store, date, and product. So Sales_amount is an example of the additive fact.
The semi-additive measure can be summed across some dimensions associated with the fact table, but not all. If we have 4 Dimensions associated with the fact table, Semi-Additive measure can be summed through 1,2,3 of them but not all of them.
Example Our fact table is the same as the above but we add measure about the quantity of product in our inventory.
Here, Items_Inventory measure can be summed through product, But Items_Inventory can’t be summed through date. So Items_Inventory is an example of the semi-additive measure.
Non-additive measures cannot be summed up for any of the dimensions associated with the fact table. If we have 4 Dimensions associated with the fact table, Non-Additive fact cannot be summed through 4 of them.
Example Our fact table is the same as the above.
Here, unit_price measure cannot be summed through any dimension (Store, Product, Date). So unit_price is an example of Non-additive fact.