Saturday, August 18, 2007

Normalization Where to Stop?

One of the most important thing we learn in our DBMS subject is Normalization. Whether you are a data modeler, DBA, or SQL developer, normalization is one of those topics we all learn. We learn this either at work or during our formal IT degree. We are taught that there are 5 different normal forms and what goes in where etc etc. But how much do we use them?

Take a look at most production databases. The best you will find that the database has been implemented using Third normal form (3NF). Very few databases reflect higher normal forms, such as Boyce-Codd normal form (BCNF), the Fourth normal form (4NF), and the Fifth normal form (5NF). So, why don't most database designers go beyond the 3NF?

I am not going deep into the normalization and their definition. A google search will bring out all those definitions so I opted to keep the definition out of this post.

How far should you go with normalization?
To be sure, each progressive step may impact upon overall performance. I have seen normalization taken to absurd lengths. In one of the recent discussion one person came out with the idea of different financial document types. As though the world of accounting is going to change. I had to remind him that Accounting is just the recording of historical events :)

A while ago I was reading an article and there the author mentioned about the normalization. Over a period of time I learnt to ask few questions before I decide how much normalization is required.
  1. What is the nature of the system. Is it an OLTP or OLAP system?
  2. What is the nature of DB Query. Are they mostly Insert or Retrieve?
  3. For Part of DB where the inserts are more, its better to have the Data in 3rd normal form.
  4. For system where Retrieve operation is more than Inserts, 2nd Normal form is the best.
Where you draw the line in the sand is ultimately up to you, but you will be better equipped to draw it with a sound understanding of the various normal forms and the risks of not going far enough. And not to forget the business requirement, after all its "Not Your Software" its the users who will be using it not the developers who write it.

Why most designers don't go beyond the 3NF?
There are few factors which affect the level of normalization we choose for the database. Most of my decision to design the database in past were based on the answers I got back after asking these questions:
  1. What is the insert/retrieve ratio?
  2. What is the database used for Transaction Recording/Processing or Decision making?
  3. What is the response time we are looking for in case of Insert, Update, Delete and Retrieve?
  4. What is the estimated peak transaction load on the database also the off-peak hour transaction load.
  5. What is the database deployment strategy Centralized or Distributed?
  6. What is the transaction control strategy. Whether it is Single-Phase commit or Multi-Phase commit?
  7. Is there a temporary Cache implemented or required?
  8. Do we need to maintain the user session or user interaction with business layer is stateless?
Asking these questions does give an overview of how the system will look and to suit the need bes the appropriate strategy can be formed.

Sometime in coming days I will write a post on how to optimize database and what are the common mistakes database designer make. Keep watching this space for the post.

Until Next Time... :)

2 comments:

Jamie Lewis said...

I was always taught to go up to 3NF, and as I see it there shouldn't be any need to go any further since after 3 everything (in theory) should be optimised.

Samir Kumar Mishra said...

Jamie,

Yes you are right. 3NF is the maximum one should go but then more often than not we are forced to denormalize to 2 - 2.5 (a hybrid of 2NF and 3NF) to gain performance advantages.

3NF is good if it is more to do with data recording but if the retrieval is more then we are better of going for a lower Normalization.

Cheers
-Samir