Designing and understanding a data model is all about understanding the concepts and the options you have in your use case and what is the best use case for each design option you have, in this article we will go through the normalization types and understand how to implement each option and pros and cons of each type.
What we will cover
- What is normalization
- Some anomalies of manipulations
- What is 1NF
- What is 2NF
- What is 3NF
- What is Boyce-Codd
- What is 4NF
- Pros. and Cons. of normalization
What is Normalization?
Let’s start by defining what is normalization, Normalization is a process of making analysis, modifications , and enhancements for our relational schema using its primary keys and Foreign Keys to achieve the following:
- Reduce redundant groups and make redundancy few as possible.
- Make Data modification more efficient with no problems or errors (anomalies).
We can consider normalization as purification or filtering of our schema.
Primary Key is a single column value used to uniquely identify a specific record in the database. -row unique identifier
Foreign Key points to the primary key of another table Thus, we can connect the tables with each other.
What is Anomalies?
Anomalies are the problems that occur in the update, delete and insert operations in poorly designed or unnormalized data when all data stored in one table (Flat file).
Example of Anomalies:
courseNo | Tutor | Lab | lab_size |
300 | Ahmed | C3 | 150 |
301 | John | A1 | 210 |
302 | Kamal | C3 | 150 |
insert anomaly What if we built a new lab(e.g. lab A4), but it’s not assigned to any courses or Tutors yet so we won’t be able to insert it to our table because (lab) comes with (course) and (Tutor) because it doesn’t have separated table.
Delete anomaly What if we need to delete (courseNo:300) that means we will delete the details of (lab:C3) also and that’s not what we want.
Update anomaly What if we improved (Lab:C3) and now it (size:250), to update it we will have to update all other columns where (Lab:C3)
In Simple words we can say anomaly is when you have one table which has multiple related informations, and you cannot do any kind of operation on a single information.
The normalization process takes our relational schema throw a series or pipeline of tests to make sure that’s it satisfy a certain normal form, this process proceeds in a top-down manner by evaluating our relational schema against the criteria of normal forms.
Functional dependency: when we say column1 is identified by column2 (column 2 is a primary key), then column1 functionally dependent on column2, can be represented graphically as (column2 —> column1).
example: (customerID) —> {name,salary}
- name and salary are identified by customerID and functionally dependent on it.
- custIomerID determines name and salary.
1. First Normal Form (1NF)
First Normal Form is the first step towards a full normalization of your data, to apply 1NF you shouldn’t have the following in your data:
- Multi-valued attributes
- Nested relations.
- Composite attributes.
Finally, the attribute must include only atomic values (each cell is single-valued; there are not repeating groups or arrays), let’s now check each type of attributes to understand how to handle it
1.1 Multi-valued attributes
Multi-valued attributes is when you have attribute that has a set of multiple values for a specific entity. For example, phone number, employee can has 2 or more phone numbers.
Example:
This table has department and all of its locations in one cell
dept_name | dept_no | dept_location |
Administration | 3 | {cairo , alexandria , aswan} |
Financial | 2 | {cairo} |
if we look at (dept_location) where (dept_no =3), we will found that there are 3 locations at one cell (row) which violates 1NF, to achieve 1NF we have to normalize it.
Solution:
we separate this group record into many records to have only one value.
dept_name | dept_no | dept_location |
Administration | 3 | {cairo} |
Administration | 3 | {alexandria} |
Administration | 3 | {aswan} |
Financial | 2 | {cairo} |
Then we will combine (dept_location) with the primary key of the table (dept_no) to the new table and remove it from the old one to overcome redundant records and uniquely identify rows, now we have:
- departments(dept_name,dept_no)
- dept_location(dept_no,dept_location)
now we have two tables at 1NF.
departments
dept_name | dept_no |
Administration | 3 |
Financial | 2 |
dept_location
dept_no | dept_location |
3 | {cairo} |
3 | {alexandria} |
3 | {aswan} |
2 | {cairo} |
1.2 Nested Relations
This table contains employees and their projects and their hours.(emp_proj)
emp_ID | emp_name | proj_no | hours |
22 | ahmed | 1 | 20.0 |
22 | ahmed | 2 | 23.0 |
22 | ahmed | 3 | 24.0 |
11 | mohamed | 3 | 28.0 |
13 | khaled | 2 | 11.0 |
13 | khaled | 4 | 22.0 |
you can find that there’s relation between (proj_no) and (hours), that (hours) depends on (proj_no) and this violates 1NF
- (proj_no , emp_ID) -> hours
To solve this issue:
Move this relation (proj_no, emp_no -> hours) to the new table with the primary key of the old table (emp_no) as a foreign key to connect the new table to our old table, then remove it from our old table and the result of that will be two tables in 1NF.
- emp_proj1(emp_id,emp_name)
- emp_proj2(emp_id,proj_no,hours)
emp_proj1
emp_ID | emp_name |
22 | ahmed |
11 | mohamed |
13 | khaled |
emp_proj2
emp_ID | proj_no | hours |
22 | 1 | 20.0 |
22 | 2 | 23.0 |
22 | 3 | 24.0 |
11 | 3 | 28.0 |
13 | 2 | 11.0 |
13 | 4 | 22.0 |
1.3 Composite attribute
This table contains employees information.(employees)
emp_no | location | dept_no | manager_no |
11 | cairo ,11511 | 2 | 2 |
24 | suez ,43511 | 1 | 7 |
45 | giza ,12511 | 5 | 4 |
like we see in (location) it’s a composite attribute consists of (city and zip code) which violates 1NF, so to normalize it we have to divide the column into sub-parts.
emp_no | city | zip | dept_no | manager_no |
11 | cairo | 11511 | 2 | 5 |
24 | suez | 43511 | 1 | 7 |
45 | giza | 12511 | 5 | 4 |
Composite key : It’s primary key, but it consists of more than one column to uniquely identify a specific record in the table. ex : We have two people with the same name mohamed ali , but they live in different locations. Hence, to uniquely identify the record, we can use (name) and (location) column as the Composite Key
2. Second Normal Form (2NF)
our schema to be in 2NF :
- It should be in 1NF.
- It shouldn’t have partial dependencies. Each non-prime attribute is full functionally dependent on the whole primary key (all prime attributes).
Prime attribute : member of the primary key OR one of attributes that is considered as primary key of table and uniquely identify rows.
if we have composite key, then we have more than one prime attribute.
Partial dependencies : when only one of the prime attributes determines another attribute with no exist of other prime attributes in this relation. OR when not all non-prime attributes depend on all prime attributes.
Ex : {x,y} are our prime attributes , z is non-prime attribute and x->z not {x,y}->z ,so it’s partial dependency because z is functionally dependent on only one prime attribute not all prime attributes.
This table contains employees and his projects information. (Emp_Proj)
emp_no | proj_no | hours | first_name | last_name | proj_name |
22 | 2 | 20.0 | ahmed | ali | projectXY |
10 | 5 | 23.0 | mohmaed | ahmed | projectCY |
34 | 2 | 21.0 | mohamed | tarek | projectXY |
From our understanding of table:
- (emp_no) and (proj_no) are prime attributes of our table.
- {emp_no , proj_no} -> Hours , both (emp_no) & (proj_no) used to identify (hours).
- emp_no -> {fname,lname}, (emp_no) only used to identify (fname) and (lname).
- (partial dependency exists)
- proj_no ->{proj_name}, (proj_no) only used to identify (proj_name).
- (partial dependency exists)
To normalize this table to 2NF we have to take any partial dependency out so:
- we will move (relations) that cause partial dependency to new table.
- (hours) will stay with (proj_no) and (emp_no) because it’s full functionally dependent on all prime attributes.
- emp_proj1(emp_no , proj_no , hours)
- (first_name) and (last_name) will be combined with (emp_no).
- emp_proj2(emp_no,fname,lname)
- (proj_name) will be combined with (proj_no).
- emp_proj3(proj_no,proj_name)
- (hours) will stay with (proj_no) and (emp_no) because it’s full functionally dependent on all prime attributes.
emp_proj1
emp_no | proj_no | hours |
22 | 2 | 20.0 |
10 | 5 | 23.0 |
34 | 2 | 21.0 |
emp_proj2
emp_no | first_name | last_name |
22 | ahmed | ali |
10 | mohmaed | ahmed |
34 | mohamed | tarek |
emp_proj3
proj_no | proj_name |
2 | projectXY |
5 | projectCY |
2 | projectXY |
3. Third Normal Form
our schema to be at 3NF
- It should be in 2NF.
- It shouldn’t have transitive functional dependencies.
Transitive functional dependencies : there’s non-prime attribute functionally dependent on another non-prime attribute OR It means that changing a value in one column leads to a change in another column -columns other than prime attributes.
EX : we say that a->c is transitive dependency if it generated from a->b & b->c not a->c directly.
In transitive dependency non-prime attribute determines another non-prime attribute, In partial dependency when only one of the prime attributes determines another attribute with no exist of other prime attributes ( because of that we called it partial dependency).
This table contains employees and their departments information. (emp_dept)
emp_no | first_name | last_name | emp_age | dept_no | dept_name |
13 | Ahmed | Ali | 42 | 3 | financial |
16 | Mohamed | Kamal | 31 | 1 | administration |
21 | Ali | Sherif | 36 | 5 | IT |
From our understanding of table:
- (emp_no) is our primary key.
- emp_no -> {first_name, last_name, emp_age, dept_no}
- dept_no -> dept_name
- (Transitive dependency exists).
To solve this issue:
- We move the relation that caused transitive dependency to new table (dept_info) and remove it from our table (emp_dept).
- the result will be 2 tables at 3NF.
- emp_dept( emp_no , emp_name , emp_age , dept_no )
- dept_info (dept_no , dept_name) dept_no become primary key of new table.
- the result will be 2 tables at 3NF.
emp_dept
emp_no | first_name | last_name | emp_age | dept_no |
13 | Ahmed | Ali | 42 | 3 |
16 | Mohamed | Kamal | 31 | 1 |
21 | Ali | Sherif | 36 | 5 |
dept_info
dept_no | dept_name |
3 | financial |
1 | administration |
5 | IT |
4. Boyce-Codd Normal Form (BCNF)
Our schema to be in BCNF:
- It should be in 3NF.
- Any attribute in table depends only on super-Key.
- a->z means (a) is super-key of (z) (even (z) is a prime attribute)
Some notes:
- If our table contains only one prime key, 3NF and BCNF are equivalent.
- BCNF is a special case of 3NF, and it also called 3.5NF.
Prime attributes are super keys, the opposite isn’t true.
This table contains employees and their projects and managers’ numbers. (emp_proj)
- The manager of an employee could be the manager of the project that the employee worked on.
emp_no | proj_no | manager_no |
11 | 1 | 2 |
24 | 4 | 7 |
45 | 2 | 4 |
15 | 4 | 7 |
From our understanding of table:
- (emp_no) and (proj_no) is our primary key (prime attributes).
- {emp_no , proj_no} -> manager_no no partial or transitive dependency.
- (manager_no) can determine proj_no so we can infer that manager_no -> proj_no
- here we have an issue that our table to be in BCNF because (manager_no) isn’t superkey (isn’t used to uniquly identify our rows) which violates BCNF.
To solve this issue:
- we will move this relation which violates BCNF to new table (proj_managers).
- result will be two tables in BCNF.
- emp_proj(emp_no , proj_no).
- proj_mangers(proj_no,manager_no).
emp_proj
emp_no | proj_no |
11 | 1 |
24 | 4 |
45 | 2 |
15 | 4 |
proj_managers
proj_no | manager_no |
1 | 2 |
4 | 7 |
2 | 4 |
5. Fourth Normal Form (4NF)
our schema to be at 4NF.
- It should be in BCNF.
- It shouldn’t have multi-valued dependencies.
Multi-valued dependencies : It usually a relationship consisting of 3 attributes (A, B, C). single value from (A) gives more than one value in (B), single value of (A) gives more than one value in (C), and (B) , (C) are independent of each other.
Ex : (emp_no , proj_no, dependents)
( employee ) do have many ( projects ), ( employee ) do have many ( dependents ) like his children and it’s obviously projects and his dependents are independent of each other, which means if we need to remove one of his projects we don’t have to delete one of his dependent. so we have here multi-valued dependency which violates 4NF.
This table contains employees and their projects and dependents. (Employee)
emp_no | proj_no | dependent |
3 | 1 | ahmed |
3 | 1 | mohamed |
3 | 1 | omar |
3 | 2 | ahmed |
3 | 2 | mohamed |
3 | 2 | omar |
From our understanding of table:
- emp_no is the primary key of table and emp_no -> { proj_no , dependent }
- we have multi-valued dependency because of:
- one record of (emp_no) can have many values for the (proj_on).
- one record of (emp_no) can have many many values for (dependent).
- projects and dependents are independents on each other.
To solve this multi-valued dependency:
- Each attribute of attributes that caused multi-valued dependency will be combined with the primary key (emp_no) in new table and result will be 2 tables in 4NF.
- emp_projects (emp_no , proj_no)
- emp_dependents (emp_no , dependent)
emp_projects
emp_no | proj_no |
3 | 1 |
3 | 2 |
emp_dependents
emp_no | dependent |
3 | ahmed |
3 | mohamed |
3 | omar |
Why do we need to normalize our tables?
When (A.C.I.D compliant) is required
It improves integrity and consistency of your data. (ACID = Atomicity.Consistency.Isolation.Durability)
Fewer storage needed
Since we eliminated repeated groups, and divided our tables, we reduced the size of our tables and database.
less logical I/O cost
When you need to retrieve data, you will retrieve smaller amount of data, and when you need to add or insert in tables, it will be easier, and more organized.
Queries become easier
If we have unnormalized table that has (location) attribute {City, Zip} as composite attribute, and we need to count the unique zip codes in our table, so we will access first location, then we will try to get zip, after normalize this table we will be able to access zip directly because location will be divided to two attributes (city) and (zip).
Write-intensive databases
Normalization increases the performance of write-intensive databases Significantly, because it reduces data modification anomalies, which make it easier to manipulate your database.
Drawbacks of Normalization
When we need to work with read-intensive databases, you may need to join data from multiple tables, and work with a huge amount of data. In normalized databases, you will need many join operations to combine data from multiple tables to satisfy user needs, which will increase time-consuming, and make itdifficult to work with a huge amount of data, so if you need to work with the read-intensive database it’s obviously normalization won’t be your optimal solution.