Normalization in Depth

DataValley Team
March 17, 2021
5:11 pm

Designing and understanding a data model is all about understanding the concepts and the options you have in your use case and what is the best use case for each design option you have, in this article we will go through the normalization types and understand how to implement each option and pros and cons of each type.

What we will cover

What is normalization
Some anomalies of manipulations
What is 1NF
What is 2NF
What is 3NF
What is Boyce-Codd
What is 4NF
Pros. and Cons. of normalization

What is Normalization?

Let’s start by defining what is normalization, Normalization is a process of making analysis, modifications , and enhancements for our relational schema using its primary keys and Foreign Keys to achieve the following:

Reduce redundant groups and make redundancy few as possible.
Make Data modification more efficient with no problems or errors (anomalies).

We can consider normalization as purification or filtering of our schema.

Primary Key is a single column value used to uniquely identify a specific record in the database. -row unique identifier

Foreign Key points to the primary key of another table Thus, we can connect the tables with each other.

What is Anomalies?

Anomalies are the problems that occur in the update, delete and insert operations in poorly designed or unnormalized data when all data stored in one table (Flat file).

Example of Anomalies:

courseNo	Tutor	Lab	lab_size
300	Ahmed	C3	150
301	John	A1	210
302	Kamal	C3	150

insert anomaly What if we built a new lab(e.g. lab A4), but it’s not assigned to any courses or Tutors yet so we won’t be able to insert it to our table because (lab) comes with (course) and (Tutor) because it doesn’t have separated table.

Delete anomaly What if we need to delete (courseNo:300) that means we will delete the details of (lab:C3) also and that’s not what we want.

Update anomaly What if we improved (Lab:C3) and now it (size:250), to update it we will have to update all other columns where (Lab:C3)

In Simple words we can say anomaly is when you have one table which has multiple related informations, and you cannot do any kind of operation on a single information.

The normalization process takes our relational schema throw a series or pipeline of tests to make sure that’s it satisfy a certain normal form, this process proceeds in a top-down manner by evaluating our relational schema against the criteria of normal forms.

Functional dependency: when we say column1 is identified by column2 (column 2 is a primary key), then column1 functionally dependent on column2, can be represented graphically as (column2 —> column1).
example: (customerID) —> {name,salary}

name and salary are identified by customerID and functionally dependent on it.
custIomerID determines name and salary.

1. First Normal Form (1NF)

First Normal Form is the first step towards a full normalization of your data, to apply 1NF you shouldn’t have the following in your data:

Multi-valued attributes
Nested relations.
Composite attributes.

Finally, the attribute must include only atomic values (each cell is single-valued; there are not repeating groups or arrays), let’s now check each type of attributes to understand how to handle it

1.1 Multi-valued attributes

Multi-valued attributes is when you have attribute that has a set of multiple values for a specific entity. For example, phone number, employee can has 2 or more phone numbers.

Example:

This table has department and all of its locations in one cell

dept_name	dept_no	dept_location
Administration	3	{cairo , alexandria , aswan}
Financial	2	{cairo}

if we look at (dept_location) where (dept_no =3), we will found that there are 3 locations at one cell (row) which violates 1NF, to achieve 1NF we have to normalize it.

Solution:

we separate this group record into many records to have only one value.

dept_name	dept_no	dept_location
Administration	3	{cairo}
Administration	3	{alexandria}
Administration	3	{aswan}
Financial	2	{cairo}

Then we will combine (dept_location) with the primary key of the table (dept_no) to the new table and remove it from the old one to overcome redundant records and uniquely identify rows, now we have:

departments(dept_name,dept_no)
dept_location(dept_no,dept_location)

now we have two tables at 1NF.

departments

dept_name	dept_no
Administration	3
Financial	2

dept_location

dept_no	dept_location
3	{cairo}
3	{alexandria}
3	{aswan}
2	{cairo}

1.2 Nested Relations

This table contains employees and their projects and their hours.(emp_proj)

emp_ID	emp_name	proj_no	hours
22	ahmed	1	20.0
22	ahmed	2	23.0
22	ahmed	3	24.0
11	mohamed	3	28.0
13	khaled	2	11.0
13	khaled	4	22.0

you can find that there’s relation between (proj_no) and (hours), that (hours) depends on (proj_no) and this violates 1NF

(proj_no , emp_ID) -> hours

To solve this issue:

Move this relation (proj_no, emp_no -> hours) to the new table with the primary key of the old table (emp_no) as a foreign key to connect the new table to our old table, then remove it from our old table and the result of that will be two tables in 1NF.

emp_proj1(emp_id,emp_name)
emp_proj2(emp_id,proj_no,hours)

emp_proj1

emp_ID	emp_name
22	ahmed
11	mohamed
13	khaled

emp_proj2

emp_ID	proj_no	hours
22	1	20.0
22	2	23.0
22	3	24.0
11	3	28.0
13	2	11.0
13	4	22.0

1.3 Composite attribute

This table contains employees information.(employees)

emp_no	location	dept_no	manager_no
11	cairo ,11511	2	2
24	suez ,43511	1	7
45	giza ,12511	5	4

like we see in (location) it’s a composite attribute consists of (city and zip code) which violates 1NF, so to normalize it we have to divide the column into sub-parts.

emp_no	city	zip	dept_no	manager_no
11	cairo	11511	2	5
24	suez	43511	1	7
45	giza	12511	5	4

Composite key : It’s primary key, but it consists of more than one column to uniquely identify a specific record in the table. ex : We have two people with the same name mohamed ali , but they live in different locations. Hence, to uniquely identify the record, we can use (name) and (location) column as the Composite Key

2. Second Normal Form (2NF)

our schema to be in 2NF :

It should be in 1NF.
It shouldn’t have partial dependencies. Each non-prime attribute is full functionally dependent on the whole primary key (all prime attributes).

Prime attribute : member of the primary key OR one of attributes that is considered as primary key of table and uniquely identify rows.
if we have composite key, then we have more than one prime attribute.

Partial dependencies : when only one of the prime attributes determines another attribute with no exist of other prime attributes in this relation. OR when not all non-prime attributes depend on all prime attributes.
Ex : {x,y} are our prime attributes , z is non-prime attribute and x->z not {x,y}->z ,so it’s partial dependency because z is functionally dependent on only one prime attribute not all prime attributes.

This table contains employees and his projects information. (Emp_Proj)

emp_no	proj_no	hours	first_name	last_name	proj_name
22	2	20.0	ahmed	ali	projectXY
10	5	23.0	mohmaed	ahmed	projectCY
34	2	21.0	mohamed	tarek	projectXY

From our understanding of table:

(emp_no) and (proj_no) are prime attributes of our table.
{emp_no , proj_no} -> Hours , both (emp_no) & (proj_no) used to identify (hours).
emp_no -> {fname,lname}, (emp_no) only used to identify (fname) and (lname).
- (partial dependency exists)
proj_no ->{proj_name}, (proj_no) only used to identify (proj_name).
- (partial dependency exists)

To normalize this table to 2NF we have to take any partial dependency out so:

we will move (relations) that cause partial dependency to new table.
- (hours) will stay with (proj_no) and (emp_no) because it’s full functionally dependent on all prime attributes.
  - emp_proj1(emp_no , proj_no , hours)
- (first_name) and (last_name) will be combined with (emp_no).
  - emp_proj2(emp_no,fname,lname)
- (proj_name) will be combined with (proj_no).
  - emp_proj3(proj_no,proj_name)

emp_proj1

emp_no	proj_no	hours
22	2	20.0
10	5	23.0
34	2	21.0

emp_proj2

emp_no	first_name	last_name
22	ahmed	ali
10	mohmaed	ahmed
34	mohamed	tarek

emp_proj3

proj_no	proj_name
2	projectXY
5	projectCY
2	projectXY

3. Third Normal Form

our schema to be at 3NF

It should be in 2NF.
It shouldn’t have transitive functional dependencies.

Transitive functional dependencies : there’s non-prime attribute functionally dependent on another non-prime attribute OR It means that changing a value in one column leads to a change in another column -columns other than prime attributes.
EX : we say that a->c is transitive dependency if it generated from a->b & b->c not a->c directly.

In transitive dependency non-prime attribute determines another non-prime attribute, In partial dependency when only one of the prime attributes determines another attribute with no exist of other prime attributes ( because of that we called it partial dependency).

This table contains employees and their departments information. (emp_dept)

emp_no	first_name	last_name	emp_age	dept_no	dept_name
13	Ahmed	Ali	42	3	financial
16	Mohamed	Kamal	31	1	administration
21	Ali	Sherif	36	5	IT

From our understanding of table:

(emp_no) is our primary key.
emp_no -> {first_name, last_name, emp_age, dept_no}
dept_no -> dept_name
- (Transitive dependency exists).

To solve this issue:

We move the relation that caused transitive dependency to new table (dept_info) and remove it from our table (emp_dept).
- the result will be 2 tables at 3NF.
  - emp_dept( emp_no , emp_name , emp_age , dept_no )
  - dept_info (dept_no , dept_name) dept_no become primary key of new table.

emp_dept

emp_no	first_name	last_name	emp_age	dept_no
13	Ahmed	Ali	42	3
16	Mohamed	Kamal	31	1
21	Ali	Sherif	36	5

dept_info

dept_no	dept_name
3	financial
1	administration
5	IT

4. Boyce-Codd Normal Form (BCNF)

Our schema to be in BCNF:

It should be in 3NF.
Any attribute in table depends only on super-Key.
- a->z means (a) is super-key of (z) (even (z) is a prime attribute)

Some notes:

If our table contains only one prime key, 3NF and BCNF are equivalent.
BCNF is a special case of 3NF, and it also called 3.5NF.

Prime attributes are super keys, the opposite isn’t true.

This table contains employees and their projects and managers’ numbers. (emp_proj)

The manager of an employee could be the manager of the project that the employee worked on.

emp_no	proj_no	manager_no
11	1	2
24	4	7
45	2	4
15	4	7

From our understanding of table:

(emp_no) and (proj_no) is our primary key (prime attributes).
{emp_no , proj_no} -> manager_no no partial or transitive dependency.
(manager_no) can determine proj_no so we can infer that manager_no -> proj_no
- here we have an issue that our table to be in BCNF because (manager_no) isn’t superkey (isn’t used to uniquly identify our rows) which violates BCNF.

To solve this issue:

we will move this relation which violates BCNF to new table (proj_managers).
result will be two tables in BCNF.
- emp_proj(emp_no , proj_no).
- proj_mangers(proj_no,manager_no).

emp_proj

emp_no	proj_no
11	1
24	4
45	2
15	4

proj_managers

proj_no	manager_no
1	2
4	7
2	4

5. Fourth Normal Form (4NF)

our schema to be at 4NF.

It should be in BCNF.
It shouldn’t have multi-valued dependencies.

Multi-valued dependencies : It usually a relationship consisting of 3 attributes (A, B, C). single value from (A) gives more than one value in (B), single value of (A) gives more than one value in (C), and (B) , (C) are independent of each other.
Ex : (emp_no , proj_no, dependents)
( employee ) do have many ( projects ), ( employee ) do have many ( dependents ) like his children and it’s obviously projects and his dependents are independent of each other, which means if we need to remove one of his projects we don’t have to delete one of his dependent. so we have here multi-valued dependency which violates 4NF.

This table contains employees and their projects and dependents. (Employee)

emp_no	proj_no	dependent
3	1	ahmed
3	1	mohamed
3	1	omar
3	2	ahmed
3	2	mohamed
3	2	omar

From our understanding of table:

emp_no is the primary key of table and emp_no -> { proj_no , dependent }
we have multi-valued dependency because of:
- one record of (emp_no) can have many values for the (proj_on).
- one record of (emp_no) can have many many values for (dependent).
- projects and dependents are independents on each other.

To solve this multi-valued dependency:

Each attribute of attributes that caused multi-valued dependency will be combined with the primary key (emp_no) in new table and result will be 2 tables in 4NF.
- emp_projects (emp_no , proj_no)
- emp_dependents (emp_no , dependent)

emp_projects

emp_no	proj_no
3	1
3	2

emp_dependents

emp_no	dependent
3	ahmed
3	mohamed
3	omar

Why do we need to normalize our tables?

When (A.C.I.D compliant) is required

It improves integrity and consistency of your data. (ACID = Atomicity.Consistency.Isolation.Durability)

Fewer storage needed

Since we eliminated repeated groups, and divided our tables, we reduced the size of our tables and database.

less logical I/O cost

When you need to retrieve data, you will retrieve smaller amount of data, and when you need to add or insert in tables, it will be easier, and more organized.

Queries become easier

If we have unnormalized table that has (location) attribute {City, Zip} as composite attribute, and we need to count the unique zip codes in our table, so we will access first location, then we will try to get zip, after normalize this table we will be able to access zip directly because location will be divided to two attributes (city) and (zip).

Write-intensive databases

Normalization increases the performance of write-intensive databases Significantly, because it reduces data modification anomalies, which make it easier to manipulate your database.

Drawbacks of Normalization

When we need to work with read-intensive databases, you may need to join data from multiple tables, and work with a huge amount of data. In normalized databases, you will need many join operations to combine data from multiple tables to satisfy user needs, which will increase time-consuming, and make itdifficult to work with a huge amount of data, so if you need to work with the read-intensive database it’s obviously normalization won’t be your optimal solution.

Normalization in Depth

What we will cover

What is Normalization?

What is Anomalies?

1. First Normal Form (1NF)

1.1 Multi-valued attributes

1.2 Nested Relations

1.3 Composite attribute

2. Second Normal Form (2NF)

3. Third Normal Form

4. Boyce-Codd Normal Form (BCNF)

5. Fourth Normal Form (4NF)

Why do we need to normalize our tables?

When (A.C.I.D compliant) is required

Fewer storage needed

less logical I/O cost

Queries become easier

Write-intensive databases

Drawbacks of Normalization

Unlimited access to educational materials for subscribers

Resources

Information

Social Media

We Accept