How is TFIDF Calculated in The Term Extraction Transformation?

Who is online?  0 guests and 0 members
Home  »  Forums   »  microsoft business intelligence   »  data mining   » How is TFIDF Calculated in The Term Extraction Transformation?

How is TFIDF Calculated in The Term Extraction Transformation?

Topic RSS Feed

Posts under the topic: How is TFIDF Calculated in The Term Extraction Transformation?

Posted: 12/23/2010

Jedi Youngling 4  points  Jedi Youngling
  • Joined on: 12/23/2010
  • Posts: 2

I am working on a project that requires the analysis of terms in various texts.  I am making use of the TFIDF functionality built into the Term Extraction Data Flow Transformation.  The documentation says that that TFIDF is calulated as follows:

     TFIDF of a Term T = (frequency of T) * log( (#rows in Input) / (#rows having T) )

I am not able to get this formula to agree with the TFIDF results produced.  Here is my sample data:

1 test house test
2 test house test
3 test house test
4 test house test
5 nobody where school
6 nobody where school
7 nobody where school
8 dog dog dog
9 dog
10 cat house dog

I run this through the Term Extraction transformation for both Frewuency and Score.  I set the trasformation to a frequency threshold of 1 to ensure nothign is missed.  I get the following results for 'school'.

     Frequency = 3
     TFIDF = 3.61191841297781

Okay.  So let's do the math to confirm the result:

     TFIDF = (3) * log( (10) / (3) ) = 1.5686362358410126881149162902347

The results clearly don't match.  I'd like to know if I am missing something or if there is a bug in the Transformation or the documentation.


Posted: 1/3/2011

Jedi Youngling 4  points  Jedi Youngling
  • Joined on: 12/23/2010
  • Posts: 2

The answer was right under my nose in BOL.  If you look at the formula, it says log.  I typically think log10 when I see log.  It's that way in excel and most calculators.  But when you read the additional documentation in BOL on the log function, you see that it is natural log.  Natural log is typically represented as ln or loge.  That's where the confusion comes in.  I'd venture to say a bug in the docs for not following common notation for log? 


Posted: 1/24/2011

Jedi Youngling 10  points  Jedi Youngling
  • Joined on: 1/24/2011
  • Posts: 5

Hi Cloris

To answer your question about logs:  people using data mining should assume the natural log, not the base 10 or any other base log.  Mathematically, the base you use does not matter in transformational space, but it will matter for a specific calculation.

You asked a question about the SSIS documentation, and this third-party forum is not the best place to ask that question.  In the past, I have had my own documentation questions, and for example, I posted on the Microsoft forums that I liked their 2008 data mining documentation, which was a great improvement over the 2005 release. 

The proper forum for that question:   http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/threads


Page 1 of 1 (3 items)