Week 22.4 Normalization in Databases
In this offline
lecture, Harkirat covers database normalization
to reduce redundancy and improve integrity. He explains relationship types like one-to-one and many-to-many
. Harkirat then dives into normalization forms
: 1NF
for eliminating multi-valued attributes, 2NF
for addressing partial dependencies, and 3NF
for eliminating transitive dependencies. He provides examples and guidance on normalizing data into these forms.
Normalization in Databases
Normalization is a systematic approach to organizing data in a database to reduce redundancy and improve data integrity. It involves breaking down a database into multiple tables and defining relationships between them based on a set of rules or normal forms. By following these normal forms, we can eliminate redundant data and ensure that data is stored in a logical and efficient manner.
Redundancy
Redundant data refers to the duplication of data across multiple tables or records within a database. This can lead to several issues, such as:
- Data Inconsistency: If the same data is stored in multiple places, updating it in one location but not the others can lead to inconsistent data.
- Wasted Storage Space: Redundant data takes up unnecessary storage space, which can become problematic as the database grows.
- Update Anomalies: When redundant data needs to be updated, multiple updates are required, increasing the risk of errors and inconsistencies.
- Insertion Anomalies: In some cases, it may not be possible to insert new data due to the lack of certain information required by the database structure.
- Deletion Anomalies: Deleting data from one table may inadvertently delete related data from another table, leading to data loss.
To illustrate the concept of redundancy, let’s consider the example provided:
users+--------+----------+| user_id| username |+--------+----------+| 1 | john || 2 | jane |+--------+----------+
user_metadata+--------+----------+------------+| user_id| name | email |+--------+----------+------------+| 1 | John Doe | john@email.com || 2 | Jane Smith| jane@email.com|+--------+----------+------------+
orders+----------+----------+------------+| order_id | user_id | name |+----------+----------+------------+| 1 | 1 | John Doe || 2 | 2 | Jane Smith |+----------+----------+------------+
In this example, the name
column in the orders
table is redundant because it duplicates the information already present in the user_metadata
table. This redundancy can lead to data inconsistencies if the name is updated in one table but not the other.
Non-Full Proof Data
The term “non-full proof data” refers to a database design that lacks proper relationships between tables, making it impossible to retrieve or associate data correctly. The example provided illustrates this issue:
users+--------+----------+| user_id| name |+--------+----------+| 1 | John Doe || 2 | Jane Doe |+--------+----------+
orders+----------+----------+| order_id | name |+----------+----------+| 1 | John Doe || 2 | Jane Doe |+----------+----------+
In this schema, there is no relationship between the users
and orders
tables. The name
column in the orders
table does not provide a reliable way to associate an order with a specific user. If two users have the same name (e.g., “John Doe”), it becomes impossible to determine which user placed a particular order.
Normalization is applied to databases that are “full proof” (i.e., have proper relationships between tables) to remove redundancy and improve data integrity.
Types of Relationships
In the context of a library management system, various types of relationships can exist between different entities or tables. Let’s explore these relationships and their use cases.
1] One-to-One Relationship
A one-to-one relationship is when a record in one table is associated with at most one record in another table, and vice versa. In the library management system, a one-to-one relationship can exist between the Users
table and the Library Card
table.
Example: Each user in the library system can have a single library card, and each library card is associated with only one user.
Users+--------+----------+| user_id| username |+--------+----------+| 1 | john || 2 | jane |+--------+----------+
Library Card+----------+----------+| card_id | user_id |+----------+----------+| 1 | 1 || 2 | 2 |+----------+----------+
In this example, the user_id
column in the Library Card
table is a foreign key referencing the user_id
column in the Users
table, establishing a one-to-one relationship.
2] One-to-Many Relationship
A one-to-many relationship is when a record in one table can be associated with multiple records in another table, but each record in the other table is associated with only one record in the first table.
Example:
In the library system, a one-to-many relationship can exist between the Users
table and the Books
table, where each user can borrow multiple books, but each book can be borrowed by only one user at a time.
Users+--------+----------+| user_id| username |+--------+----------+| 1 | john || 2 | jane |+--------+----------+
Books+----------+----------+------------+| book_id | user_id | book_title |+----------+----------+------------+| 1 | 1 | Book A || 2 | 1 | Book B || 3 | 2 | Book C |+----------+----------+------------+
In this example, the user_id
column in the Books
table is a foreign key referencing the user_id
column in the Users
table, establishing a one-to-many relationship.
3] Many-to-One Relationship
A many-to-one relationship is the inverse of a one-to-many relationship. It occurs when multiple records in one table are associated with a single record in another table.
Example:
In the library system, a many-to-one relationship can exist between the Books
table and the Genre
table, where multiple books can belong to a single genre, but each book can have only one genre.
Books+----------+------------+| book_id | book_title |+----------+------------+| 1 | Book A || 2 | Book B || 3 | Book C |+----------+------------+
Genre+----------+-------------+| genre_id | genre_name |+----------+-------------+| 1 | Fiction || 2 | Non-Fiction |+----------+-------------+
Book_Genre+----------+----------+| book_id | genre_id |+----------+----------+| 1 | 1 || 2 | 1 || 3 | 2 |+----------+----------+
In this example, the Book_Genre
table is a junction table that establishes the many-to-one relationship between the Books
and Genre
tables.
4] Many-to-Many Relationship
A many-to-many relationship is when multiple records in one table can be associated with multiple records in another table, and vice versa.
Example:
In the library system, a many-to-many relationship can exist between the Books
table and the Authors
table, where a book can have multiple authors, and an author can write multiple books.
Books+----------+------------+| book_id | book_title |+----------+------------+| 1 | Book A || 2 | Book B || 3 | Book C |+----------+------------+
Authors+----------+---------------+| author_id| author_name |+----------+---------------+| 1 | Author X || 2 | Author Y || 3 | Author Z |+----------+---------------+
Book_Author+----------+----------+| book_id | author_id|+----------+----------+| 1 | 1 || 1 | 2 || 2 | 2 || 3 | 3 |+----------+----------+
In this example, the Book_Author
table is a junction table that establishes the many-to-many relationship between the Books
and Authors
tables.
5] Final Graph
Based on the relationships discussed above, the final graph representing the library management system could look like this:
+----------+ +----------+| Users |1 1|Library || | | Card |+----------+ +----------+ | |1 |\\ | \\ | \\ | \\ | \\ | \\ N 1 | |+----------+ +----------+| Books |1 N| Genre |+----------+ +----------+ N N | |+----------+| Authors |+----------+
This graph illustrates the one-to-one relationship between Users
and Library Card
, the one-to-many relationship between Users
and Books
, the many-to-one relationship between Books
and Genre
, and the many-to-many relationship between Books
and Authors
.
Normalizing Data
Normalization in databases is a systematic approach to organizing data in a logical and efficient manner. It involves decomposing tables to eliminate data redundancy and improve data integrity. The normalization process typically progresses through several normal forms, each building upon the previous one. The goal is to achieve a database design that minimizes data anomalies and ensures data consistency.
Normal Forms
When analyzing a database schema, you can identify its level of normalization based on the following normal forms:
- First Normal Form (1NF)
- Second Normal Form (2NF)
- Third Normal Form (3NF)
- Boyce-Codd Normal Form (BCNF)
- Fourth Normal Form (4NF)
- Fifth Normal Form (5NF)
In practice, most database designs aim to reach the Third Normal Form (3NF) or the Boyce-Codd Normal Form (BCNF). Going beyond these forms can lead to excessive joins and potentially impact performance.
First Normal Form (1NF)
A table is in the First Normal Form (1NF) if it meets the following criteria:
-
Atomicity: A single cell must not hold more than one value. This rule ensures that each column of a database table holds only atomic (indivisible) values, and multi-valued attributes are split into separate columns.
Example: If a column is meant to store phone numbers, and a person has multiple phone numbers, each number should be in a separate row, not as a list or set in a single cell.
# Not in 1NF+--------+---------------+| user_id| phone_numbers |+--------+---------------+| 1 | 123-456-7890, 987-654-3210 || 2 | 555-123-4567 |+--------+---------------+# In 1NF+--------+---------------+| user_id| phone_number |+--------+---------------+| 1 | 123-456-7890 || 1 | 987-654-3210 || 2 | 555-123-4567 |+--------+---------------+ -
Primary Key: Each table should have a primary key, which is a column (or a set of columns) that uniquely identifies each row in a table.
-
No Duplicated Rows: To ensure data integrity, each row in the table should be unique. This rule works hand-in-hand with the presence of a primary key to prevent duplicate entries, which can lead to data anomalies.
-
Single Value per Column: Each column must have only one value for each row in the table. This rule emphasizes that every column must hold only one value per row, and that value should be of the same kind for that column across all rows.
# Not in 1NF+--------+---------------+| user_id| phone_numbers |+--------+---------------+| 1 | 123-456-7890, 987-654-3210 || 2 | 555-123-4567 |+--------+---------------+# In 1NF+--------+---------------+| user_id| phone_number |+--------+---------------+| 1 | 123-456-7890 || 1 | 987-654-3210 || 2 | 555-123-4567 |+--------+---------------+
Second Normal Form (2NF)
A table is in the Second Normal Form (2NF) if it meets the following criteria:
- It is already in the First Normal Form (1NF).
- It has no partial dependency.
Partial Dependency: This occurs when a non-primary key attribute is dependent on part of a composite primary key, rather than on the whole primary key. In simpler terms, if your table has a primary key made up of multiple columns, a partial dependency exists if an attribute in the table is dependent only on a subset of those columns that form the primary key.
Example:
Consider a table with the composite primary key (StudentID, CourseID)
and other attributes like InstructorName
and CourseName
. If CourseName
is dependent only on CourseID
and not on the complete composite key (StudentID, CourseID)
, then CourseName
has a partial dependency on the primary key. This violates the Second Normal Form (2NF).
Before Normalization:
Enrollments+------------+------------+---------------+---------------+| student_id | course_id | course_name | instructor_name|+------------+------------+---------------+---------------+| 1 | 1 | Introduction to Programming | John Doe || 1 | 2 | Data Structures | Jane Smith|| 2 | 1 | Introduction to Programming | John Doe |+------------+------------+---------------+---------------+
In the above table, the primary key is (student_id, course_id)
. However, CourseName
and InstructorName
have a partial dependency on CourseID
because they are determined solely by the course, not by the combination of student_id
and course_id
.
After Normalization (2NF): To normalize the table to 2NF, we need to separate the attributes that have a partial dependency into a new table.
Courses+------------+---------------+---------------+| course_id | course_name | instructor_name|+------------+---------------+---------------+| 1 | Introduction to Programming | John Doe || 2 | Data Structures | Jane Smith|+------------+---------------+---------------+
Enrollments+------------+------------+| student_id | course_id |+------------+------------+| 1 | 1 || 1 | 2 || 2 | 1 |+------------+------------+
In the normalized schema, the Courses
table contains the course_id
, course_name
, and instructor_name
, eliminating the partial dependency. The Enrollments
table now only contains the student_id
and course_id
, which form the primary key.
Third Normal Form (3NF)
When a table is in the Second Normal Form (2NF), it eliminates repeating groups and redundancy, but it does not eliminate transitive partial dependency.
For a table to be in the Third Normal Form (3NF), it must:
- Be in the Second Normal Form (2NF).
- Have no transitive partial dependency.
Transitive Dependency: A transitive dependency in a relational database occurs when one non-key attribute indirectly depends on the primary key through another non-key attribute.
Example: Consider a table with the following schema:
Employees+-------------+---------------+---------------+| employee_id | employee_name | department_name |+-------------+---------------+---------------+| 1 | John Doe | Sales || 2 | Jane Smith | Marketing || 3 | Bob Johnson | Sales |+-------------+---------------+---------------+
In this table, department_name
has a transitive dependency on the primary key employee_id
. The relationship is as follows:
employee_id
determinesemployee_name
employee_name
determinesdepartment_name
Therefore, department_name
indirectly depends on employee_id
through employee_name
.
To normalize to 3NF:
To eliminate the transitive dependency, we need to separate the department_name
attribute into a new table.
Employees+-------------+---------------+---------------+| employee_id | employee_name | department_id |+-------------+---------------+---------------+| 1 | John Doe | 1 || 2 | Jane Smith | 2 || 3 | Bob Johnson | 1 |+-------------+---------------+---------------+
Departments+---------------+---------------+| department_id | department_name |+---------------+---------------+| 1 | Sales || 2 | Marketing |+---------------+---------------+
In the normalized schema, the Employees
table now has a department_id
column that directly depends on the primary key employee_id
. The Departments
table contains the department_id
and department_name
, eliminating the transitive dependency.
By following the normalization process and adhering to the Third Normal Form (3NF), you can ensure that your database design minimizes data redundancy, maintains data integrity, and reduces the risk of data anomalies.
It’s important to note that while normalization is a crucial step in database design, over-normalization can lead to excessive joins and potentially impact performance. In practice, most database designs aim to reach the Third Normal Form (3NF) or the Boyce-Codd Normal Form (BCNF), as going beyond these forms may not provide significant benefits and can introduce unnecessary complexity.