Concept#
TF-IDF is used in the case document is as dimension and the case term document matrix.
Note this section’s notation is not consistent with the notations defined in this notes.
We will denote the vocabulary as an ordered (lexicographically) set of words \(\mathcal{V} = \{v_1, v_2, \dots, v_T\}\), where \(T\) is the size of the vocabulary.
We will denote the set of documents as \(\mathcal{S} = \{s_1, s_2, \dots, s_D\}\), where \(D\) is the number of documents.
Motivation#
Suppose we have a corpus consisting of \(5\) documents as follows:
corpus = [
"The sun is the largest celestial body in the solar system",
"The solar system consists of the sun and eight revolving planets",
"Ra was the Egyptian Sun God",
"The Pyramids were the pinnacle of Egyptian architecture",
"The quick brown fox jumps over the lazy dog",
]
corpus_names = ["doc_1", "doc_2", "doc_3", "doc_4", "doc_5"]
where each element in corpus
corresponds to a document.
The unique vocabularies in lexicographical order are as follows:
vocabulary = ['and', 'architecture', 'body', 'brown', ...]
To this end, the corpus
is our set \(\mathcal{S}\) and the vocabulary
is our set \(\mathcal{V}\).
Term Frequency (TF)#
Suppose now we have a new document query
as follows:
query = "Why did the chicken cross the road?"
and we want to rank the documents in corpus
according to their relevance to query
.
A simple way to start out is by eliminating documents that do not contain all the words “why”, “did”, “the”, “chicken”, “cross” and “road”. However, none of the documents will be eliminated since all of them contain the common word “the”.
To further distinguish them, we can count the number of times each word from query
occurs in each document. The number of times a word occurs in a document is called its term frequency. For example, in doc_4
, the word the
occurred \(2\) times, while in doc_1
, it occurred \(3\) times. Hence we can intuitively rank doc_1
over doc_4
as being more relevant to query
.
Inverse Document Frequency (IDF)#
The word “the” is very common and can skew the results by giving too much emphasis to documents that happen to use it more often, instead of focusing on more meaningful terms like “chicken”, “cross”, and “road”. To address this issue, an inverse document frequency factor is added to reduce the weight of frequently occurring terms and increase the weight of rarely occurring terms, making them more significant in distinguishing relevant and non-relevant documents and terms.
Term Frequency (TF)#
Term frequency, \(\operatorname{tf}_{t, d}\), is the relative frequency of word/term/vocab \(t \in \mathcal{V}\) in document \(d \in \mathcal{S}\), i.e., the number of times that term \(t\) occurs in document \(d\) divided by the total number of terms in document \(d\).
where
\(f_{t, d}\) is the raw count of a term in a document, i.e., the number of times that term \(t\) occurs in document \(d\). Note the denominator is simply the total number of terms in document \(d\) (counting each occurrence of the same term separately).
There are various other ways to define term frequency, for example, scikit-learn’s TfidfVectorizer uses the the raw count itself as the term frequency, i.e., \(f_{t, d}\).
For our purpose, we will follow the definition in [Jurafsky and Martin, 2022] and use the logarithmically scaled version of the raw count:
Inverse Document Frequency (IDF)#
Define the document frequency \(\operatorname{df}_{t}\) as the number of documents in the corpus that contain the term \(t\):
The inverse document frequency is a measure of how much information the word provides, i.e., if it is common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient):
with
\(D\) : total number of documents in the corpus \(D = \left|\mathcal{S}\right|\)
\(\operatorname{df}_{t} = |\{d \in \mathcal{S} \quad \text{s.t.} \quad t \in d\}|\) : number of documents where the term \(t\) appears. However, this term \(t\) is from the query and hence may or may not be in the existing training corpus. If it is not, then this will make \(\operatorname{df}_{t}=0\), causing division-by-zero. Hence, a variant of the inverse document frequency is used:
which is the implementation used by scikit-learn’s TfidfVectorizer. Another thing is that scikit-learn uses \(\log\) which is the natural logarithm, i.e., \(\log_{e}\), instead of \(\log_{10}\).
Term Frequency-Inverse Document Frequency (TF-IDF)#
The tf-idf weighted value of a term \(t\) in a document \(d\) is simply the product of the term frequency and the inverse document frequency, which we denote as \(w_{t, d}\):
So what this formula is doing intuitively is if \(D=10\) for example (\(10\) documents) and say the word “the” appears in all \(10\) documents, then \(\operatorname{df}_{t} = 10\) and \(\operatorname{idf}_{t} = \log(10/10) = \log(1) = 0\). This leads to a weight \(w_{t, d} = 10 \times 0\). This means that the word “the” is too common and hence the idf weight is \(0\), essentially, this means this word is ignore.
On the other hand, if the word “chicken” appears in only \(1\) document, then we have \(\operatorname{df}_{t} = 1\) and \(\operatorname{idf}_{t} = \log(10/1) = \log(10) = 2.3\), which holds a higher weight \(w_{t, d} = 1 \times 2.3 = 2.3\).
Justification and Link with Information Theory#
See Wikipedia.
Conceptual Questions#
Why do we use logarithmically scaled term frequency?#
The log transformation is used to “squash” the term frequencies to reduce the impact of very high frequency terms, as they can disproportionately dominate the weighting of a document. By using log, the scaling factor becomes less significant for very high frequency terms, while still allowing for a differentiation between non-zero and zero term frequency. The logarithmic scaling also ensures that the range of values is compressed, allowing for a better distribution of weights across the different terms. Overall, using the log transformation helps to normalize the term frequency values and prevent any one term from dominating the analysis.