Posted: 12/23/2010
I am working on a project that requires the analysis of terms in various texts. I am making use of the TFIDF functionality built into the Term Extraction Data Flow Transformation. The documentation says that that TFIDF is calulated as follows:
TFIDF of a Term T = (frequency of T) * log( (#rows in Input) / (#rows having T) )
I am not able to get this formula to agree with the TFIDF results produced. Here is my sample data:
1 test house test2 test house test3 test house test4 test house test5 nobody where school6 nobody where school7 nobody where school8 dog dog dog9 dog10 cat house dog
I run this through the Term Extraction transformation for both Frewuency and Score. I set the trasformation to a frequency threshold of 1 to ensure nothign is missed. I get the following results for 'school'.
Frequency = 3 TFIDF = 3.61191841297781
Okay. So let's do the math to confirm the result:
TFIDF = (3) * log( (10) / (3) ) = 1.5686362358410126881149162902347
The results clearly don't match. I'd like to know if I am missing something or if there is a bug in the Transformation or the documentation.
Posted: 1/3/2011
The answer was right under my nose in BOL. If you look at the formula, it says log. I typically think log10 when I see log. It's that way in excel and most calculators. But when you read the additional documentation in BOL on the log function, you see that it is natural log. Natural log is typically represented as ln or loge. That's where the confusion comes in. I'd venture to say a bug in the docs for not following common notation for log?
Posted: 1/24/2011
Hi Cloris
To answer your question about logs: people using data mining should assume the natural log, not the base 10 or any other base log. Mathematically, the base you use does not matter in transformational space, but it will matter for a specific calculation.
You asked a question about the SSIS documentation, and this third-party forum is not the best place to ask that question. In the past, I have had my own documentation questions, and for example, I posted on the Microsoft forums that I liked their 2008 data mining documentation, which was a great improvement over the 2005 release.
The proper forum for that question: http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/threads