Nermalisation

An introduction to the normalisation of databases that requires no prior knowledge and serves as an excuse to introduce cat content into MySQL-dump.
Simply speaking, normalization is about flexibility and avoiding redundantly storing information, that is, storing the same piece of knowledge in more than one place.
Consider some cats and their owners (or, in Cat, our "tin-openers"):

So far, so good. Now another cat strays to Louise:

It's pretty obvious that we'll be in a world of pain once we also wish to store more attributes for each cat (date of birth, vaccination, ...).
Moreover, cat would be a string in databases that do not support multi-valued fields (columns), so looking up Lara would be awkward ("show me all rows that have 'Lara' as a whole word anywhere in the field cat_name"). This is awkward, error-prone, and slow, so let's discard the idea while we can:

Now we stored "Louise" and "Los Angeles" twice. This is somewhat awkward, we'd rather only change one entry if Louise moves or changes her name:

We have eliminated the redundant city, but if Louise's name should change, we'll have to update three entries (her name in the owners table, and two references to that entry from the cats table). Since in the real world, people do move and do get married (and sometimes just change their names because, well, they wish to), maybe the name is not a very good key. We'll make up an arbitrary one that will never change because it does not rely on the actual payload data:

Now a new cat strays to Sarah. Like Louise, Sarah fancies the name Lara:

The cats are still unique — Sarah's Lara is Lara/2 in our cats table, Louise's is Lara/3. At this point however, word gets around in cat circles that Louise rocks — Ada escapes from Andrea-world and strays to Louise! (cat_name,owner_no) no longer considered stable key!

Sarah's Lara heard the news as well, and also strays to Louise:

Louise is unaware of the newcomer's name and calls her Callisto — apparently, a cat's name is not a constant. Once more, we fix our tables:

Now, a cat can have any number of names (one after the other), and each owner can feed any number of cats. We can even make overlapping entries in our latest table to account for cats sneakily eating in several places. So far, so normal. Or is it?
What happens if Louise moves, but not with all "her" cats?
What changes will we need if twin cats arrive that get the same name because Sarah can't tell them apart, anyway? (Rows must be unique!)
Denormalisation:
In some cases, it may be worthwhile to separate a table's history from its status quo (to keep the table holding the current data small, to speed up lookups in it, or to put the history table in one of MySQL's merge tables):
In this example, we'll have to insert the relevant line from the current table (using the current date for the until value) before updating the line in the current table.

If in doubt, use consecutive integers as a (primary) key (integer primary key auto_increment not null etc.); do not use strings as keys if you can avoid it, if the above didn't convince you, maybe this will: string lookups aren't exactly fast. Do not use "match-codes" or any other keys that encode payload data; keys should be opaque.
Consider some cats and their owners (or, in Cat, our "tin-openers"):
cat | owner | city |
---|---|---|
Ada | Andrea | Anchorage |
Shelley | Sarah | San Diego |
Lynn | Louise | Los Angeles |

So far, so good. Now another cat strays to Louise:
cat | owner | city |
---|---|---|
Ada | Andrea | Anchorage |
Shelley | Sarah | San Diego |
Lynn, Lara | Louise | Los Angeles |

It's pretty obvious that we'll be in a world of pain once we also wish to store more attributes for each cat (date of birth, vaccination, ...).
Moreover, cat would be a string in databases that do not support multi-valued fields (columns), so looking up Lara would be awkward ("show me all rows that have 'Lara' as a whole word anywhere in the field cat_name"). This is awkward, error-prone, and slow, so let's discard the idea while we can:
cat | owner | city |
---|---|---|
Ada | Andrea | Anchorage |
Shelley | Sarah | San Diego |
Lynn | Louise | Los Angeles |
Lara | Louise | Los Angeles |

Now we stored "Louise" and "Los Angeles" twice. This is somewhat awkward, we'd rather only change one entry if Louise moves or changes her name:
owner | city |
---|---|
Andrea | Anchorage |
Sarah | San Diego |
Louise | Los Angeles |
cat | owner (referencing other table, "foreign key") |
---|---|
Ada | Andrea |
Shelley | Sarah |
Lynn | Louise |
Lara | Louise |

We have eliminated the redundant city, but if Louise's name should change, we'll have to update three entries (her name in the owners table, and two references to that entry from the cats table). Since in the real world, people do move and do get married (and sometimes just change their names because, well, they wish to), maybe the name is not a very good key. We'll make up an arbitrary one that will never change because it does not rely on the actual payload data:
owner_no | owner | city |
---|---|---|
1 | Andrea | Anchorage |
2 | Sarah | San Diego |
3 | Louise | Los Angeles |
cat | owner_no (referencing other table, "foreign key") |
---|---|
Ada | 1 |
Shelley | 2 |
Lynn | 3 |
Lara | 3 |

Now a new cat strays to Sarah. Like Louise, Sarah fancies the name Lara:
owner_no | owner | city |
---|---|---|
1 | Andrea | Anchorage |
2 | Sarah | San Diego |
3 | Louise | Los Angeles |
cat | owner_no (referencing other table, "foreign key") |
---|---|
Ada | 1 |
Shelley | 2 |
Lynn | 3 |
Lara | 3 |
Lara | 2 |

The cats are still unique — Sarah's Lara is Lara/2 in our cats table, Louise's is Lara/3. At this point however, word gets around in cat circles that Louise rocks — Ada escapes from Andrea-world and strays to Louise! (cat_name,owner_no) no longer considered stable key!
owner_no | owner | city |
---|---|---|
1 | Andrea | Anchorage |
2 | Sarah | San Diego |
3 | Louise | Los Angeles |
cat_no | cat |
---|---|
1 | Ada |
2 | Shelley |
3 | Lynn |
4 | Lara |
5 | Lara |
cat_no | lives_with_owner_no | from | until |
---|---|---|---|
1 | 1 | 1998 | 2002 |
1 | 3 | 2002 | NULL (NULL: still living there, makes looking up current location nice and easy) |
2 | 2 | 1997 | NULL |
3 | 3 | 1850 | NULL |
4 | 3 | 1999 | NULL |
5 | 2 | 2002 | NULL |

Sarah's Lara heard the news as well, and also strays to Louise:
owner_no | owner | city |
---|---|---|
1 | Andrea | Anchorage |
2 | Sarah | San Diego |
3 | Louise | Los Angeles |
cat_no | cat |
---|---|
1 | Ada |
2 | Shelley |
3 | Lynn |
4 | Lara |
5 | Lara |
cat_no | lives_with_owner_no | from | until |
---|---|---|---|
1 | 1 | 1998 | 2002 |
1 | 3 | 2002 | NULL (NULL: still lives there, makes looking up current location nice and easy) |
2 | 2 | 1997 | NULL |
3 | 3 | 1850 | NULL |
4 | 3 | 1999 | NULL |
5 | 2 | 2002 | 2002 |
5 | 3 | 2002 | NULL |

Louise is unaware of the newcomer's name and calls her Callisto — apparently, a cat's name is not a constant. Once more, we fix our tables:
owner_no | owner | city |
---|---|---|
1 | Andrea | Anchorage |
2 | Sarah | San Diego |
3 | Louise | Los Angeles |
cat_name_no | cat_name |
---|---|
1 | Ada |
2 | Shelley |
3 | Lynn |
4 | Lara |
5 | Callisto |
lives_with_owner_no | from | until | under_the_name |
---|---|---|---|
1 | 1998 | 2002 | 1 |
3 | 2002 | NULL | 1 |
2 | 1997 | NULL | 2 |
3 | 1850 | NULL | 3 |
3 | 1999 | NULL | 4 |
2 | 2002 | 2002 | 4 |
3 | 2002 | NULL | 5 |

Now, a cat can have any number of names (one after the other), and each owner can feed any number of cats. We can even make overlapping entries in our latest table to account for cats sneakily eating in several places. So far, so normal. Or is it?
What happens if Louise moves, but not with all "her" cats?
What changes will we need if twin cats arrive that get the same name because Sarah can't tell them apart, anyway? (Rows must be unique!)
Denormalisation:
In some cases, it may be worthwhile to separate a table's history from its status quo (to keep the table holding the current data small, to speed up lookups in it, or to put the history table in one of MySQL's merge tables):
owner_no | owner | city |
---|---|---|
1 | Andrea | Anchorage |
2 | Sarah | San Diego |
3 | Louise | Los Angeles |
cat_name_no | cat_name |
---|---|
1 | Ada |
2 | Shelley |
3 | Lynn |
4 | Lara |
5 | Callisto |
lived_with_owner_no | from | until | under_the_name (history) |
---|---|---|---|
1 | 1998 | 2002 | 1 |
2 | 2002 | 2002 | 4 |
lives_with_owner_no | from | under_the_name (current) |
---|---|---|
3 | 2002 | 1 |
2 | 1997 | 2 |
3 | 1850 | 3 |
3 | 1999 | 4 |
3 | 2002 | 5 |
In this example, we'll have to insert the relevant line from the current table (using the current date for the until value) before updating the line in the current table.

If in doubt, use consecutive integers as a (primary) key (integer primary key auto_increment not null etc.); do not use strings as keys if you can avoid it, if the above didn't convince you, maybe this will: string lookups aren't exactly fast. Do not use "match-codes" or any other keys that encode payload data; keys should be opaque.
Trackbacks
Katze mit Wut on : Nermalisierung
Eine Einführung in die Normalisierung von Datenbanken, die kein Vorwissen voraussetzt und Katzencontent liefert.
ASCII version, English versionStark vereinfacht geht es bei Normalisierung um Flexibilität und Redundanzvermeidung.
Gegeben seien Katzen u
The Quiet Earth on : Databases and Normalization
Relational Database Management Systems (RDBMSs) use special forms to keep data from being kept redundant: normalized tables. If you’re like me, you haven’t memorized or completely understood normalization up to now.
If you do, however, wan...
MySQL-dump on : Annotations to "Five Common PHP database problems"
In Five Common PHP database problems, Jack Herrington writes about five common database problems that occur in PHP applications -- including database schema design, database access, and the business logic code that uses the database -- as well as their s
Comments
Display comments as Linear | Threaded
eckes on :
Azundris on :
http://dev.mysql.com/doc/mysql/en/server-system-variables.html
Azundris on :
fdask on :
I am a little unclear what the point here is though... break tables up?
Azundris on :
http://en.wikipedia.org/wiki/Database_normalization
Max Sinister on :
Kimbo Slice on :
Kimbo Slice on :