Posted: 5/9/2011
According to Kimball the date dimension should have a smart date key which is an integer formatted like so: YYYYMMDD. It supports easy partitioning and only requires 4 bytes of storage. Date Time Datatype requires 8 bytes.
However, SQL 2008 supports a Date Datatype which requires 3 bytes of storage. Some Kimball folks indicate that you should still use a INT key because it makes it easier to designate erroneous or missing date values from the source.
Has anyone tested the join performance of INT vs Date datatype keys?
What are you folks using right now?
Actually, it's R2 that supports a Date datatype, that feature was ripped out of SQL 2008. I haven't tested it yet, but for most people the INT type will still make sense until they're up on R2. I haven't tested the performance, but it seems like it would be more efficient in the long run, if only by a small amount. Really wish we could take advantage of that change, but it's just not to be for some time over here. :(
As for why you would still use an INT, I don't quite get that one. It's still a sequential set of dates, but using DATE would actually be constrained to real dates as opposed to someone chucking an extra digit into the INT range, thus making it an invalid date.
All my dimensions have INT data types for their keys. I have a key in all my dimensions (including my Date dimension) that represents an unknown value. I find this useful for a variety of reasons. How would I do this if I used the Date data type as my Date Dimension Key? This is why I think I will stick with INT.
What value are you using for your unknown value anyway? I typically use 19000101 which is also acceptable as a Date datatype value as well.
OK - quick note. I'm apparently wrong about DATE and TIME not being in SQL 2008, but I really could have sworn they were not in the RTM version. Anyone know if these were added in a service pack or CU? I'll definitely be checking these out going forward now that I know they exist for 2008. In the meantime, we're using INT values because that's our current standard and I don't know that we'll refactor this anytime soon.
I use an INT value of zero for my Unknown Date ID.
Posted: 6/13/2011
Haven't used Date Datatype Keys. I use INT for all my Keys and Zero to represent the unknown value. INT keys for date (YYYYMMDD) have worked great till now. They are also helpful when writing ad-hoc SQL queries on the DWH and trying to filter across date dimension because of the simplicity in which they are stored.
On a side note for unknown values - For some other dimensions, we have used multiple unknown rows. For example, we want to clearly distinguish between unknown values, blank values, pending values and so forth in the dim tables. For some business these values mean very different things and while reporting they help us identify the incoming data and the transformed data very easily.