In the field of database design, normalization is an important topic that many people don’t give much, if any, consideration. Chances are, if you have designed a database, you have probably been more concerned with how everything makes sense to you. This article is not a lecture on normalization, but more of a comparison of when to use normalization (more specifically 3NF) and when to model dimensionally. Both options have their advantages and disadvantages in different situations.
Basic Overview of Normalization and Dimensional Modeling
To better understand third normal form (3NF), let’s take a brief look at the basics of first and second normal forms as well. Normalization is the removal of redundant data and separation of attributes into organized tables. Let’s take a database that is devoid of all normalization. Think of it as one large spreadsheet. There is potentially a lot of redundant data and it’s not very organized. You may even find fields containing more than one value and groups of repeating information.
In first normal form (1NF), you start to eliminate some of the redundancy. Row and column fields only have one value, and the main table is broken down into smaller, more organized tables.
In second normal form (2NF), data is that is not fully dependant on the primary key is moved into another table. This organizes similar data just a little bit more.
In third normal form (3NF), data that is not dependant on the primary key is moved into another table. This is usually the highest form of normalization you will work with, even though normalization goes all the way up seventh normal form.
In dimensional modeling, data is split into fact tables and dimension tables. The dimension tables store data that gives information about the data in the fact tables. For example, let’s say you have a database that contains product information, customer information, and customer transactions. The fact table might contain information on the customer transactions (quantity ordered, total cost, etc.) and the dimension tables would contain information about the customers, such as address, contact info, etc.
When to Normalize
A normalized database is ideal for situations in which data is being written and updated frequently, such as websites and OLTP applications. Let’s say a customer is maintaining an online account of some sort. If he makes changes to his street address, then there is only one field in the database that needs to be written to. That decreases the time needed for writing to the database and decreases the chances of inconsistent data. Just think if a system like that was set up on a dimensional model, and a customer’s address was stored in more than one location. If the data wasn’t maintained correctly, a customer could have more than one address on file as their current address. A marketing department would then have problems sending mail to this customer. A normalized database also increases data security. Again, this goes back to data being stored in one location as opposed to multiple places. The only real disadvantage to a normalized database is the decrease in database performance. This is caused by the amount of joins required when retrieving data from the database.
When you are working with a data warehouse or OLAP environment, then you would definitely want to denormalize. Data warehouses and OLAP environments are used to store historical data for reporting and analysis purposes. The database performance is generally better since there are not as many joins required when retrieving data. In many cases, there would also be less indexes as opposed to a normalized database, thus, saving on storage space. However, due to the increase of data redundancy, the data going into the database must be carefully controlled. Because of this, data modification is usually slower than a normalized database. If you have a field that is stored in three different places, you would have to make sure that all three of those values are updated.
Star and Snowflake Schemas
In a data warehouse, if you take a denormalized approach, you will use a star schema. A star schema is the simplest form of data warehouse schema. It consists of one or two fact tables and multiple dimension tables as mentioned above. In a star schema, the fact tables have a compound primary key, and the dimension tables contain a simple primary key. The diagram below demonstrates the idea behind fact and dimension tables.
A snowflake schema is more complex than the star schema. In a snowflake schema, normalization is introduced in the dimension tables. This causes a need for more foreign key joins and, thus, could potentially increase data retrieval times. The following diagram shows a basic snowflake schema. Notice the added levels to the dimension tables.
In either scenario, when designing a database, you must always keep the end users’ needs in mind. What kind of data will it store? What will the end users do with it? In the end you will hopefully end up with a database that is optimal for its intended use.