I was recently asked to provide some comments on how model mailing categories. The client could understand how to create a simple dimension as a lookup table but then kept trying treat the dimension as source of measures. Here are the quick thoughts I sent his way. Afterwards, I thought that it would make a decent posting.
How to model the categories? I can see this going a couple different ways. One idea would be to have a dimension of subject line types. I could imagine that the attributes of the dimension would be categories of subject lines or categories of items found in the subject line. The dimension might look something like this:
ContentTypeDesc (“Financial”, “Urgent”, “Misspellings”, etc.)
+ intermediate bridge table(s) that join the DimSubjectContentType to the Fact tables.
Clearly, a single email could contain more than one of these categories. If you take the first approach then you would have a flattened table with a field for every possible flag that you want to track. Each record would contain a unique combination of the flags. The trick then would be to identify the correct record that represents the combination of flags for each email. I could see using a hash of the combinations to provide the single field that would become the unique business key for each record. The problem with this approach is that the table can grow and may be quite wide. The plus is that it is simpler to use and is more performant than a M2M construct. It also avoids dangers of double counting records when you use the M2M approach.
The M2M approach provides the flexibility to store as many flag types as you want. It is more complicated in that you need to use 1 or 2 intermediate tables. In your situation, you would probably only need 1 bridge table. In Analysis Services, there are actually two logical intermediate tables an group Dimension and a group Fact table. Since you were not planning on using AS right away, I will not dive into that but I did provide some good links below. The M2M approach is more complex for the ETL and risks query complications. It also can be worse for performance- especially in SSAS.
Here are some of my notes and links I have on this situation.
http://www.pythian.com/news/364/implementing-many-to-many-relationships-in-data-warehousing/ (Provides some example for modeling but is geared towards SQL queries.)
http://www.sqlbi.com/articles/many2many/ - Marco Russo and Alberto Ferrari. Large PDF (In downloads folder) of Many to Many Relationships.
The next thought I had was the idea that you might consider using a new fact table as well (or add additional measures onto an existing fact table.) These measures would provide you with the ability to count instances of emails where one of the subject types existed. In this case, I would lean towards separate measures for each type. The measures would be a 1 or a 0 where 1 indicates the existence of the detected subject item. This provides you with the ability to quickly create counts based on your dimensions. I would not simply count the emails based on the DimSubjectContentType dimension types since that would require correlated subqueries and be very bad for performance.
I think that you would want to do both of these things. The dimension would provide the slicing, reporting and filtering. The fact would provide you with counts.