Denormalization when, why, and how !?
- March 25, 2021
- Category: Data Engineering Data Modeling
What is de-normalization?
De-normalization is an optimization technique to make our database respond faster to queries by reducing the number of joins needed to satisfy user needs.
- In de-normalization, we mainly aim to reduce the number of tables that are needed by re-joining these tables together and add redundant data.
- De-normalization is commonly used with read-intensive systems such as Data Warehouses
Read-intensive = low number of updates + high number of read queries.
De-normalization doesn’t mean that we won’t normalize our tables, It’s like we said before, an optimization technique that used after normalizing our table to make it faster in some cases.
For example, If we have a customers table and orders table, the orders table does have customer_id only (as a foreign key) that referenced to customers table, but It doesn’t have customer_name.
When we need to retrieve a list of all orders with the customer name, we will have to join these tables together.
In this case, there are many great things like if we need to update customer_name, we will update it in one place, there are no data manipulation problems.
But what if our tables are so large? joining tables will be costly, and we will spend an unimportant time joining tables together.
- In this case, we will go for de-normalization, It’s okay to have some redundant data on the other hand you will have fewer joins, and efficient queries.
solution: We will use customer_id (foreign key) to add customer_name to the orders table.
- We will describe de-normalization techniques widely below.
So briefly, De-normalization is used for:
- Reduce the number and need for joins.
- Reduce the number of needed tables.
- Reduce foreign keys of your database.
Before doing de-normalization you have to make sure of two things:
– The performance of the normalized system doesn’t satisfy the user.
– De-normalization is the right solution for this performance issue.
To de-normalize our normalized table, We will follow some methods that will be discussed below.
- Adding Redundant columns.
- Coming tables.
- Adding Derived column.
- partitioning Relation.
- Horizontal partitioning.
- Vertical partitioning.
- Materialized Views.
2.1 Add Redundant columns
We can apply it by adding commonly used columns in joins to the joined table for reducing or eliminating join operations.
Like we see before, Consider customers’ table and orders table.
If we frequently need to generate a report that shows all orders with the customer name, we will have to join these tables together.
SELECT C.CUSTOMER_NAME, O.ORDER_NAME FROM CUSTOMERS C JOIN ORDERS O ON C.CUSTOMER_ID = O.CUSTOMER_ID;
To de-normalize this table, we will add a redundant customer_name column to Orders, Which will increase performance and we won’t need to join this table again.
And Customer will remain the same.
SELECT O.CUSTOMER_NAME, O.ORDER_NAME FROM ORDERS O;
Drawbacks of this method:
- Maintenance will be costly, and increase update overhead, because updates will be made for two tables: customers(name) and orders(customer_name).
- More storage will be needed because customer_name is duplicated now.
- Increase in Table Size.
- In case of an update in one value of a customer, you have to update all the records of that customer in the fact table which will be a very costly operation for big Fact tables.
2.2 Combining Tables
We can apply it by combine tables that are joined together frequently into one table, which will eliminate join, and increase performance significantly. For example, Consider customers table and addresses table.
If we frequently need to generate a report that shows all Employees their full addresses, we will have to join these tables together.
SELECT E.* , A.CITY+", "+A.STATE+", "+A.DISTRICT+", "+A.ZIP AS "FULL ADDRESS" FROM EMPLOYEES E JOIN ADDRESSES A ON A.LOCATION_ID = e.LOCATION_ID;
So to de-normalize this situation, We will combine employees’ table and addresses table into one table, And combine address table attributes into one attribute (Full address) to make querying more easier, This solution will increase performance significantly, and eliminate costly joins.
SELECT * FROM EMPLOYEES;
2.3 Adding Derived Column
Adding derived columns to our table can help us to eliminate joins, and improve the performance of aggregating our data.
A derived column is attribute whose value is derived from another attribute. Example: Using date_of_birth attribute to generate age attribute.
For example, Consider students’ table and grades table.
If we need to generate a report containing Students and their grades, We will join students table and grades table, and start comparing marks with grades to know the grade of each student.
SELECT S.NAME, S.MARKS, G.GRADE FROM STUDENTS S JOIN GRADES G ON S.MARKS BETWEEN G.MIN_MARK AND G.MAX_MARK;
So to de-normalize this situation, We will use marks in students’ table and compare them to predefined values to generate grades with no need of joining grades with students.
- This solution will use attributes of the table to generate results, that can be generated using joins, to improve the performance of queries.
CASE WHEN: is a SQL statement that used to generate derived columns in specific conditions using the available columns., think like you have If conditions but for SQL.
WHEN ComparsionCondition THEN result1
WHEN ComparsionCondition THEN result2
END AS Derived_column_name
SELECT NAME, MARKS, CASE WHEN MARKS >= 85 AND MARKS <= 100 THEN "A" WHEN MARKS >= 75 AND MARKS < 85 THEN "B" WHEN MARKS >= 65 AND MARKS < 75 THEN "C" WHEN MARKS >= 50 AND MARKS < 65 THEN "D" ELSE "F" END AS "GRADE" FROM STUDENTS;
2.4 Partitioning Relations
In this approach, We won’t combine tables together, We will go for decomposing them into multiple smaller manageable tables, It will decrease the size that we have to read, which will impact the performance of operations in some meaningful way.
Two main types:
- Horizontal partitioning
- Vertical partitioning.
2.4.1 Horizontal Partitioning (Row Splitting)
Split our main table rows into smaller partitions (tables) that will have the same columns.
For example, Consider the employees’ table, we can split this table into multiple tables which have the same attributes, we can split employees table into multiple partitions by using their departments, make a partition for each department employees, which will make querying employees based on their departments more efficient.
This approach aims to make where clause more efficient by making it search in a smaller amount of data, Filter specify only a subset of the table that related to the query, not the whole table, and reduced I/O overhead.
2.4.1 Vertical Partitioning (Column Splitting)
In Vertical partitioning, We distribute table attributes across multiple partitions with primary key duplicated for each partition to make reconstructing of original table easier. We partition our table based on frequently used attributes and rarely used attributes.
For example, What if we access column1 frequently, and column2 rarely accessed, We will apply Vertical partitioning here by splitting the table into two tables, one will have a primary key and column1 for frequent access, and another one will have a primary key, and column2 for rarely access.
We need to use this approach When some columns are frequently accessed more than other columns, To reduce table header size, And retrieve only the required attributes.
In case you have multiple requirements that needs the data combined you will have to join the tables again which will cause performance problems.
2.4.2 Materialized Views
Materialized Views can improve performance and decrease time-consuming significantly, by using it to precomputing and store the result of costly queries like join and aggregation as view in your storage disk for future usage.
Materialized View is all about run one time, and read many times.
Let us take the previous example for customers and orders When a user needs to generate a report for all employees with their orders.
you can run the query using joins normally, and store it to materialized view, It will solve the performance issue because when the user wants to generate the report again, the result will back to him from view, there’s no need for recomputing every time he needs this report.
CREATE Materialized Views CUSTOMER_ORDERS AS SELECT C.CUSTOMER_NAME, O.ORDER_NAME FROM CUSTOMERS C JOIN ORDERS O ON C.CUSTOMER_ID = O.CUSTOMER_ID; SELECT * FROM CUSTOMER_ORDERS;
So now, We have the query that used to generate the customer_order report as materialized view, So when we need to generate this report, we will select directly from our view, and no need to run the join query every time we need the report.
When you need to use a query frequently, you can store it as materialized view, So in the future, you can retrieve the result of your query directly from the view stored in your disk, So you don’t need to recompute query again.
- Data will be updated once, And to refresh you have to re-run the query again.
- The unavailable source will block maintenance of view.
- Data are replicated, And it needs more storage.
Drawbacks of Denormalization
- De-normalization can slow updates, with many update overheads.
- De-normalization can increase your table and database size.
- De-normalization in some cases can make querying more complex instead of making it easier.
- More storage will be needed for duplicated data.
De-normalization and data warehouses
De-normalization is stable and commonly used with data warehouses, Data warehouse is a Specially created data repository for decision making, It involves a large historical data repository related to the organization, The typical data warehouse is a subject-oriented corporate database that involves multiple data models implemented on multiple platforms and architectures.
There are some aspects to consider when building data warehouses:
– Extraction of data from several sources that may be homogeneous or heterogeneous.
– Initialize data for compatibility in the data warehouse.
– Cleaning the data with validity, and is done through the database from which the data were taken.
– Monitor and control the data warehouse while it is uploading data.
– Update data every period of time.
De-normalization can increase the speed of retrieval and optimize query performance for data warehouses, with some drawbacks of update anomalies, but it won’t be a big problem because data warehouse is not typically used for update, It’s commonly used for reading operations, which are used for analysis and decision making as we said, hence, a data warehouse is a great source for applying de-normalization, because it attributes rarely updated.
Use Case: Star Schema
Dimensional Model: It is a special model that is an alternative to an entity-relationship (ER) model consisting of the ER information and itself, but combines the data in an abbreviated form that makes the model more understandable to perform the queries efficiently, and has the flexibility to change (Dimensional model is the modeling approach of data warehouses).
Star schema is the simplest and easiest dimensional model for data warehouses, because of that it’s the most suitable schema for query processing, and it’s highly de-normalized, but it drawback is its need for a large space to store data.
Star schema consists of a fact table with a single table for each dimension.
- A fact table in a pure star schema consists of multiple foreign keys, each paired with a primary key in a dimension, together with the facts containing the measurements.
- Typically normalized.
- Dimension tables not joined for each other.
- It joined using a fact table that does have a foreign key for each dimension.
- Typically heavily de-normalized.
For our use case, We will consider the Sakila database, and we will de-normalize it to star schema.
Sakiladatabase is a normalized schema for a DVD rental store, created by MySql.
– For the full description of Sakila -> link
The 3NF schema Sakila
Let us consider this query for sum movie rental revenue one before de-normalizing database to star schema, and one after de-normalizing, To know performance difference between two cases.
To get the film title with the sum of the sales amount for each film, We will join the film and payment table using the rental and inventory table. payment -> rental -> inventory -> film
- For better understanding look at our 3NF schema above.
SELECT F.TITLE, SUM(P.AMOUNT) AS REVENUE FROM PAYMENT P JOIN RENTAL R ON ( P.RENTAL_ID = R.RENTAL_ID ) JOIN INVENTORY I ON ( R.INVENTORY_ID = I.INVENTORY_ID ) JOIN FILM F ON ( I.FILM_ID = F.FILM_ID ) GROUP BY F.TITLE;
like we see here, We need 3 joins operations, Which will increase query processing time, and harm our performance. We will consider our modified query after de-normalizing our database to star schema, So let us start.
We need to make a dimensional model (Star schema) from this 3NF Sakila schema, to help decision-makers in Sakila to analyze sales data and aggregated it easily with low Query processing time, and high performance. We will generate our dimensions using response to these questions:
- Who? customers.
- Where? store.
- What? film
- When? date.
Let’s start creating our star schema:
- we will create customer dimension using customer, address, city, country and rental tables.
- We will create store dimension using store, staff, address, city, country and rental tables..
- We will create movie dimension using film table, rental table and language table.
- We will create date dimension using payment_date from payment table.
then, we will create our fact table that will contain: foreign key for each dimension of our dimensions, and sales_amount.
- factsales(saleskey, customerKey, storekey, moviekey, datekey, sales_amount)
Our Star Schema
Let us consider the Movie rental revenue query again.Now, If we need to get the film title with the sum of the sales amount for each film, We will join dimmovie with factsales directly.dimmovie -> factsales.
SELECT M.TITLE, SUM(S.SALES_AMOUNT) AS REVENUE FROM FACTSALES S JOIN DIMMOVIE M ON ( S.MOVIEKEY = M.MOVIEKEY ) GROUP BY M.TITLE;
Like we see here, We need only 1 join operation, it does have a very better performance than our normalized query (3 joins), retrieved faster, and has less query processing time.
De-normalization aims to add redundancy to your database for better performance, It also a great optimization technique that will help you to decrease query processing time. with drawbacks of reducing the integrity of your system, slowing data manipulation operations, and need more space for storing redundant data.