loader image

Normalization in Depth

Designing and understanding a data model is all about understanding the concepts and the options you have in your use case and what is the best use case for each design option you have, in this article we will go through the normalization types and understand how to implement each option and pros and cons of each type.

What we will cover

  1. What is normalization
  2. Some anomalies of manipulations
  3. What is 1NF
  4. What is 2NF
  5. What is 3NF
  6. What is Boyce-Codd
  7. What is 4NF
  8. Pros. and Cons. of normalization

What is Normalization?

Let’s start by defining what is normalization, Normalization is a process of making analysis, modifications , and enhancements for our relational schema using its primary keys and Foreign Keys to achieve the following:

  1.  Reduce redundant groups and make redundancy few as possible.
  2.  Make Data modification more efficient with no problems or errors (anomalies).

We can consider normalization as purification or filtering of our schema.

Primary Key is a single column value used to uniquely identify a specific record in the database. -row unique identifier

Foreign Key points to the primary key of another table Thus, we can connect the tables with each other.

What is Anomalies?

Anomalies are the problems that occur in the update, delete and insert operations in poorly designed or unnormalized data when all data stored in one table (Flat file).

Example of Anomalies:

courseNoTutorLablab_size
300AhmedC3150
301JohnA1210
302KamalC3150

insert anomaly What if we built a new lab(e.g. lab A4), but it’s not assigned to any courses or Tutors yet so we won’t be able to insert it to our table because (lab) comes with (course) and (Tutor) because it doesn’t have separated table.

Delete anomaly What if we need to delete (courseNo:300) that means we will delete the details of (lab:C3) also and that’s not what we want.

Update anomaly What if we improved (Lab:C3) and now it (size:250), to update it we will have to update all other columns where (Lab:C3)

In Simple words we can say anomaly is when you have one table which has multiple related informations, and you cannot do any kind of operation on a single information.

The normalization process takes our relational schema throw a series or pipeline of tests to make sure that’s it satisfy a certain normal form, this process proceeds in a top-down manner by evaluating our relational schema against the criteria of normal forms.

Functional dependency: when we say column1 is identified by column2 (column 2 is a primary key), then column1 functionally dependent on column2, can be represented graphically as (column2 —> column1).
example: (customerID) —> {name,salary}

  • name and salary are identified by customerID and functionally dependent on it.
  • custIomerID determines name and salary.

1. First Normal Form (1NF)

First Normal Form is the first step towards a full normalization of your data, to apply 1NF you shouldn’t have the following in your data:

  1. Multi-valued attributes
  2. Nested relations.
  3. Composite attributes.

Finally, the attribute must include only atomic values (each cell is single-valued; there are not repeating groups or arrays), let’s now check each type of attributes to understand how to handle it

1.1 Multi-valued attributes

Multi-valued attributes is when you have attribute that has a set of multiple values for a specific entity. For example, phone number, employee can has 2 or more phone numbers.

Example:

This table has department and all of its locations in one cell

dept_namedept_nodept_location
Administration3{cairo , alexandria , aswan}
Financial2{cairo}

if we look at (dept_location) where (dept_no =3), we will found that there are 3 locations at one cell (row) which violates 1NF, to achieve 1NF we have to normalize it.

Solution:

we separate this group record into many records to have only one value.

dept_namedept_nodept_location
Administration3{cairo}
Administration3{alexandria}
Administration3{aswan}
Financial2{cairo}

Then we will combine (dept_location) with the primary key of the table (dept_no) to the new table and remove it from the old one to overcome redundant records and uniquely identify rows, now we have:

  • departments(dept_name,dept_no)
  • dept_location(dept_no,dept_location)

now we have two tables at 1NF.

departments

dept_namedept_no
Administration3
Financial2

dept_location

dept_nodept_location
3{cairo}
3{alexandria}
3{aswan}
2{cairo}

1.2 Nested Relations

This table contains employees and their projects and their hours.(emp_proj)

emp_IDemp_nameproj_nohours
22ahmed120.0
22ahmed223.0
22ahmed324.0
11mohamed328.0
13khaled211.0
13khaled422.0

you can find that there’s relation between (proj_no) and (hours), that (hours) depends on (proj_no) and this violates 1NF

  • (proj_no , emp_ID) -> hours

To solve this issue:

Move this relation (proj_no, emp_no -> hours) to the new table with the primary key of the old table (emp_no) as a foreign key to connect the new table to our old table, then remove it from our old table and the result of that will be two tables in 1NF.

  • emp_proj1(emp_id,emp_name)
  • emp_proj2(emp_id,proj_no,hours)

emp_proj1

emp_IDemp_name
22ahmed
11mohamed
13khaled

emp_proj2

emp_IDproj_nohours
22120.0
22223.0
22324.0
11328.0
13211.0
13422.0

1.3 Composite attribute

This table contains employees information.(employees)

emp_nolocationdept_nomanager_no
11cairo ,1151122
24suez ,4351117
45giza ,1251154

like we see in (location) it’s a composite attribute consists of (city and zip code) which violates 1NF, so to normalize it we have to divide the column into sub-parts.

emp_nocityzipdept_nomanager_no
11cairo1151125
24suez4351117
45giza1251154

Composite key : It’s primary key, but it consists of more than one column to uniquely identify a specific record in the table. ex : We have two people with the same name mohamed ali , but they live in different locations. Hence, to uniquely identify the record, we can use (name) and (location) column as the Composite Key

2. Second Normal Form (2NF)

our schema to be in 2NF :

  1. It should be in 1NF.
  2. It shouldn’t have partial dependencies. Each non-prime attribute is full functionally dependent on the whole primary key (all prime attributes).

Prime attribute : member of the primary key OR one of attributes that is considered as primary key of table and uniquely identify rows.

if we have composite key, then we have more than one prime attribute.

Partial dependencies : when only one of the prime attributes determines another attribute with no exist of other prime attributes in this relation. OR when not all non-prime attributes depend on all prime attributes.

Ex : {x,y} are our prime attributes , z is non-prime attribute and x->z not {x,y}->z ,so it’s partial dependency because z is functionally dependent on only one prime attribute not all prime attributes.

This table contains employees and his projects information. (Emp_Proj)

emp_noproj_nohoursfirst_namelast_nameproj_name
22220.0ahmedaliprojectXY
10523.0mohmaedahmedprojectCY
34221.0mohamedtarekprojectXY

From our understanding of table:

  • (emp_no) and (proj_no) are prime attributes of our table.
  • {emp_no , proj_no} -> Hours , both (emp_no) & (proj_no) used to identify (hours).
  • emp_no -> {fname,lname}, (emp_no) only used to identify (fname) and (lname).
    • (partial dependency exists)
  • proj_no ->{proj_name}, (proj_no) only used to identify (proj_name).
    • (partial dependency exists)

To normalize this table to 2NF we have to take any partial dependency out so:

  • we will move (relations) that cause partial dependency to new table.
    • (hours) will stay with (proj_no) and (emp_no) because it’s full functionally dependent on all prime attributes.
      • emp_proj1(emp_no , proj_no , hours)
    • (first_name) and (last_name) will be combined with (emp_no).
      • emp_proj2(emp_no,fname,lname)
    • (proj_name) will be combined with (proj_no).
      • emp_proj3(proj_no,proj_name)

emp_proj1

emp_noproj_nohours
22220.0
10523.0
34221.0

emp_proj2

emp_nofirst_namelast_name
22ahmedali
10mohmaedahmed
34mohamedtarek

emp_proj3

proj_noproj_name
2projectXY
5projectCY
2projectXY

3. Third Normal Form

our schema to be at 3NF

  1. It should be in 2NF.
  2. It shouldn’t have transitive functional dependencies.

Transitive functional dependencies : there’s non-prime attribute functionally dependent on another non-prime attribute OR It means that changing a value in one column leads to a change in another column -columns other than prime attributes.

EX : we say that a->c is transitive dependency if it generated from a->b & b->c not a->c directly.

In transitive dependency non-prime attribute determines another non-prime attribute, In partial dependency when only one of the prime attributes determines another attribute with no exist of other prime attributes ( because of that we called it partial dependency).

This table contains employees and their departments information. (emp_dept)

emp_nofirst_namelast_nameemp_agedept_nodept_name
13AhmedAli423financial
16MohamedKamal311administration
21AliSherif365IT

From our understanding of table:

  • (emp_no) is our primary key.
  • emp_no -> {first_name, last_name, emp_age, dept_no}
  • dept_no -> dept_name
    • (Transitive dependency exists).

To solve this issue:

  • We move the relation that caused transitive dependency to new table (dept_info) and remove it from our table (emp_dept).
    • the result will be 2 tables at 3NF.
      • emp_dept( emp_no , emp_name , emp_age , dept_no )
      • dept_info (dept_no , dept_name) dept_no become primary key of new table.

emp_dept

emp_nofirst_namelast_nameemp_agedept_no
13AhmedAli423
16MohamedKamal311
21AliSherif365

dept_info

dept_nodept_name
3financial
1administration
5IT

4. Boyce-Codd Normal Form (BCNF)

Our schema to be in BCNF:

  1. It should be in 3NF.
  2. Any attribute in table depends only on super-Key.
    • a->z means (a) is super-key of (z) (even (z) is a prime attribute)

Some notes:

  • If our table contains only one prime key, 3NF and BCNF are equivalent.
  • BCNF is a special case of 3NF, and it also called 3.5NF.

Prime attributes are super keys, the opposite isn’t true.

This table contains employees and their projects and managers’ numbers. (emp_proj)

  • The manager of an employee could be the manager of the project that the employee worked on.
emp_noproj_nomanager_no
1112
2447
4524
1547

From our understanding of table:

  • (emp_no) and (proj_no) is our primary key (prime attributes).
  • {emp_no , proj_no} -> manager_no no partial or transitive dependency.
  • (manager_no) can determine proj_no so we can infer that manager_no -> proj_no
    • here we have an issue that our table to be in BCNF because (manager_no) isn’t superkey (isn’t used to uniquly identify our rows) which violates BCNF.

To solve this issue:

  • we will move this relation which violates BCNF to new table (proj_managers).
  • result will be two tables in BCNF.
    • emp_proj(emp_no , proj_no).
    • proj_mangers(proj_no,manager_no).

emp_proj

emp_noproj_no
111
244
452
154

proj_managers

proj_nomanager_no
12
47
24

5. Fourth Normal Form (4NF)

our schema to be at 4NF.

  1. It should be in BCNF.
  2. It shouldn’t have multi-valued dependencies.

Multi-valued dependencies : It usually a relationship consisting of 3 attributes (A, B, C). single value from (A) gives more than one value in (B), single value of (A) gives more than one value in (C), and (B) , (C) are independent of each other.
Ex : (emp_no , proj_no, dependents)
( employee ) do have many ( projects ), ( employee ) do have many ( dependents ) like his children and it’s obviously projects and his dependents are independent of each other, which means if we need to remove one of his projects we don’t have to delete one of his dependent. so we have here multi-valued dependency which violates 4NF.

This table contains employees and their projects and dependents. (Employee)

emp_noproj_nodependent
31ahmed
31mohamed
31omar
32ahmed
32mohamed
32omar

From our understanding of table:

  • emp_no is the primary key of table and emp_no -> { proj_no , dependent }
  • we have multi-valued dependency because of:
    • one record of (emp_no) can have many values for the (proj_on).
    • one record of (emp_no) can have many many values for (dependent).
    • projects and dependents are independents on each other.

To solve this multi-valued dependency:

  • Each attribute of attributes that caused multi-valued dependency will be combined with the primary key (emp_no) in new table and result will be 2 tables in 4NF.
    • emp_projects (emp_no , proj_no)
    • emp_dependents (emp_no , dependent)

emp_projects

emp_noproj_no
31
32

emp_dependents

emp_nodependent
3ahmed
3mohamed
3omar

Why do we need to normalize our tables?

When (A.C.I.D compliant) is required

It improves integrity and consistency of your data. (ACID = Atomicity.Consistency.Isolation.Durability)

Fewer storage needed

Since we eliminated repeated groups, and divided our tables, we reduced the size of our tables and database.

less logical I/O cost

When you need to retrieve data, you will retrieve smaller amount of data, and when you need to add or insert in tables, it will be easier, and more organized.

Queries become easier

If we have unnormalized table that has (location) attribute {City, Zip} as composite attribute, and we need to count the unique zip codes in our table, so we will access first location, then we will try to get zip, after normalize this table we will be able to access zip directly because location will be divided to two attributes (city) and (zip).

Write-intensive databases

Normalization increases the performance of write-intensive databases Significantly, because it reduces data modification anomalies, which make it easier to manipulate your database.

Drawbacks of Normalization

When we need to work with read-intensive databases, you may need to join data from multiple tables, and work with a huge amount of data. In normalized databases, you will need many join operations to combine data from multiple tables to satisfy user needs, which will increase time-consuming, and make itdifficult to work with a huge amount of data, so if you need to work with the read-intensive database it’s obviously normalization won’t be your optimal solution.

Facebook
Twitter

Leave a Reply

Your email address will not be published. Required fields are marked *

Unlimited access to educational materials for subscribers

Ask ChatGPT
Set ChatGPT API key
Find your Secret API key in your ChatGPT User settings and paste it here to connect ChatGPT with your Tutor LMS website.
Hi, Welcome back!
Forgot?
Don't have an account?  Register Now