Term Frequency-Inverse Document Frequency (TF-IDF)
Contents
Term Frequency-Inverse Document Frequency (TF-IDF)#
TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a popular technique used in information retrieval and natural language processing to quantify the importance of each word in a document or corpus of documents.
Motivation#
Consider the example in the section on representing words as vectors using document dimension. Let’s add one more vocab to the matrix, the most frequent word the, which appears many times across all documents.
As You Like It |
Twelfth Night |
Julius Caesar |
Henry V |
|
---|---|---|---|---|
battle |
1 |
0 |
7 |
13 |
good |
114 |
80 |
62 |
89 |
fool |
36 |
58 |
1 |
4 |
wit |
20 |
15 |
2 |
3 |
the |
1000 |
1200 |
900 |
1100 |
Then by the cosine similarity metric, we will see that many words will be similar to the word the
.
This is skewed information, if you consider the table above and see the word the
and good
, then
if you want to do document retrieval based on these two words alone (hypothetically), will yield you
the document that contains the word the
the most, which is not what we want. This kind of violate the idea
“similar documents tend to have similar words” mentioned in words and vectors, because the word the
is so common that it appears in
almost all documents, so it is not a good indicator of similarity.
In what follows, the basic idea behind TF-IDF is to give a high weight to words that appear frequently in a document but rarely in other documents, while giving a low weight to words that are common across all documents. The weight assigned to each word is calculated based on two factors: term frequency (TF) and inverse document frequency (IDF). This is a solution to the aforementioned problem.