Implementation#
Let’s implement TF-IDF from scratch.
from collections import OrderedDict
import numpy as np
import pandas as pd
from rich.pretty import pprint
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
Defining the Corpus#
We use a simple corpus with 5 documents, defined under the variable corpus.
corpus = [
"The sun is the largest celestial body in the solar system",
"The solar system consists of the sun and eight revolving planets",
"Ra was the Egyptian Sun God",
"The Pyramids were the pinnacle of Egyptian architecture",
"The quick brown fox jumps over the lazy dog",
]
corpus_names = ["doc_1", "doc_2", "doc_3", "doc_4", "doc_5"]
num_documents = len(corpus)
vocabulary = " ".join(corpus).split() # split the corpus into words
vocabulary = [word.lower() for word in vocabulary] # lower case
vocabulary = list(set(vocabulary)) # unique vocabulary
vocabulary.sort() # sort alphabetically
num_vocabs = len(vocabulary)
print(f"Number of documents: {num_documents}")
print(f"Number of unique words: {num_vocabs}")
Number of documents: 5
Number of unique words: 30
So far, we have performed lowercasing as the only preprocessing step.
The num_documents and num_vocabs encapulates the number of documents and the number of unique words in the corpus, respectively.
Scikit-Learn’s Implementation#
Let’s see how scikit-learn implements TF-IDF by running it on the corpus we defined above.
pipe = Pipeline(
[
("count", CountVectorizer(vocabulary=vocabulary)),
("tfid", TfidfTransformer(norm=None, smooth_idf=False, sublinear_tf=False)),
]
).fit(corpus)
tfidf_matrix = pipe.transform(corpus)
print(f"Shape of tfidf matrix: {tfidf_matrix.shape}")
print(tfidf_matrix.toarray().T)
Shape of tfidf matrix: (5, 30)
[[0. 2.60943791 0. 0. 0. ]
[0. 0. 0. 2.60943791 0. ]
[2.60943791 0. 0. 0. 0. ]
[0. 0. 0. 0. 2.60943791]
[2.60943791 0. 0. 0. 0. ]
[0. 2.60943791 0. 0. 0. ]
[0. 0. 0. 0. 2.60943791]
[0. 0. 1.91629073 1.91629073 0. ]
[0. 2.60943791 0. 0. 0. ]
[0. 0. 0. 0. 2.60943791]
[0. 0. 2.60943791 0. 0. ]
[2.60943791 0. 0. 0. 0. ]
[2.60943791 0. 0. 0. 0. ]
[0. 0. 0. 0. 2.60943791]
[2.60943791 0. 0. 0. 0. ]
[0. 0. 0. 0. 2.60943791]
[0. 1.91629073 0. 1.91629073 0. ]
[0. 0. 0. 0. 2.60943791]
[0. 0. 0. 2.60943791 0. ]
[0. 2.60943791 0. 0. 0. ]
[0. 0. 0. 2.60943791 0. ]
[0. 0. 0. 0. 2.60943791]
[0. 0. 2.60943791 0. 0. ]
[0. 2.60943791 0. 0. 0. ]
[1.91629073 1.91629073 0. 0. 0. ]
[1.51082562 1.51082562 1.51082562 0. 0. ]
[1.91629073 1.91629073 0. 0. 0. ]
[3. 2. 1. 2. 2. ]
[0. 0. 2.60943791 0. 0. ]
[0. 0. 0. 2.60943791 0. ]]
This code defines a pipeline that applies two scikit-learn transformers to a corpus of
text documents, and then fits the pipeline to the corpus. The first
transformer is CountVectorizer, which converts the text into a matrix
of token counts. The vocabulary argument specifies the set of tokens to
use. The second transformer is TfidfTransformer, which applies term
frequency-inverse document frequency (TF-IDF) weighting to the count
matrix.
The norm, smooth_idf, and sublinear_tf parameters of
TfidfTransformer control the normalization and smoothing of the IDF
values and the sublinear scaling of the term frequencies. By setting
norm to None, the transformer does not normalize the document
vectors, so they will have the same length as the original count vectors.
By setting smooth_idf and sublinear_tf to False, the transformer
does not apply any smoothing or scaling to the IDF and term frequency
values.
The fitted pipeline is then applied to the corpus to obtain a TF-IDF
matrix, which is stored in tfidf_matrix. The shape attribute of
tfidf_matrix gives the dimensions of the matrix, which is a sparse
matrix with the number of rows equal to the number of documents in the
corpus and the number of columns equal to the number of unique tokens in
the vocabulary. The toarray() method is used to convert the sparse
matrix to a dense matrix, and the T attribute is used to transpose the
matrix so that the rows correspond to tokens and the columns correspond
to documents. The resulting matrix is printed to the console.
count = pipe["count"].transform(corpus)
print(f"Count Matrix: {count.shape}")
print(count.toarray().T)
Count Matrix: (5, 30)
[[0 1 0 0 0]
[0 0 0 1 0]
[1 0 0 0 0]
[0 0 0 0 1]
[1 0 0 0 0]
[0 1 0 0 0]
[0 0 0 0 1]
[0 0 1 1 0]
[0 1 0 0 0]
[0 0 0 0 1]
[0 0 1 0 0]
[1 0 0 0 0]
[1 0 0 0 0]
[0 0 0 0 1]
[1 0 0 0 0]
[0 0 0 0 1]
[0 1 0 1 0]
[0 0 0 0 1]
[0 0 0 1 0]
[0 1 0 0 0]
[0 0 0 1 0]
[0 0 0 0 1]
[0 0 1 0 0]
[0 1 0 0 0]
[1 1 0 0 0]
[1 1 1 0 0]
[1 1 0 0 0]
[3 2 1 2 2]
[0 0 1 0 0]
[0 0 0 1 0]]
This code retrieves the count matrix and IDF vector that were generated by the pipeline for a given corpus of text documents.
The first part applies the CountVectorizer transformer to the corpus
using the transform() method of the count step in the pipeline, which
returns a sparse matrix of token counts. The toarray() method is then
called to convert the sparse matrix to a dense array, which is assigned to
the variable count.
The second part retrieves the IDF values that were computed by the
TfidfTransformer transformer using the idf_ attribute, which returns
a one-dimensional array of IDF values for each token in the vocabulary.
The IDF vector is assigned to the variable idf_vec.
idf_vec = pipe["tfid"].idf_
print(f"IDF Vector: {idf_vec.shape}")
print(idf_vec.T)
IDF Vector: (30,)
[2.60943791 2.60943791 2.60943791 2.60943791 2.60943791 2.60943791
2.60943791 1.91629073 2.60943791 2.60943791 2.60943791 2.60943791
2.60943791 2.60943791 2.60943791 2.60943791 1.91629073 2.60943791
2.60943791 2.60943791 2.60943791 2.60943791 2.60943791 2.60943791
1.91629073 1.51082562 1.91629073 1. 2.60943791 2.60943791]
Let’s try to implement it ourself.
Implementing TF-IDF#
Getting the Count Matrix#
The code below creates an ordered dictionary word_freq to store the frequency of each word in each document in the corpus. The structure of the dictionary is {word: {doc_1: freq, doc_2: freq, ...}, ...}.
The code loops through each word in the vocabulary, and for each word, it checks if it is already in the
word_freqdictionary. If it is not, it adds the word to the dictionary.It then loops through each document and for each document, it checks if the document name is already in the
word_freq[word]dictionary. If it is not, it adds the document name to the dictionary.Finally, it counts the number of times the word appears in the document using the
count()method of Python strings, and adds this count to the corresponding entry in theword_freqdictionary.The
pprint()function from thepprintmodule is then used to print the dictionaryword_freqin a readable format.
# Create an ordered dictionary to store the frequency of each word in each document in the corpus
word_freq = OrderedDict() # {word: {doc_1: freq, doc_2: freq, ...}, ...}
# Loop through each word in the vocabulary
for word in vocabulary:
# If the word is not in the word_freq dictionary, add it
if word not in word_freq:
word_freq[word] = OrderedDict()
# Loop through each document and count the number of times the word appears in the document
for doc, doc_name in zip(corpus, corpus_names):
# If the document name is not in the word_freq[word] dictionary, add it
if doc_name not in word_freq[word]:
word_freq[word][doc_name] = 0
# Count the number of times the word appears in the document using the count() method
word_freq[word][doc_name] += doc.lower().split().count(word)
# Print the word_freq dictionary using the pprint function for pretty printing
pprint(dict(word_freq.items()))
{ │ 'and': OrderedDict([('doc_1', 0), ('doc_2', 1), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]), │ 'architecture': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 1), ('doc_5', 0)]), │ 'body': OrderedDict([('doc_1', 1), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]), │ 'brown': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 1)]), │ 'celestial': OrderedDict([('doc_1', 1), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]), │ 'consists': OrderedDict([('doc_1', 0), ('doc_2', 1), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]), │ 'dog': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 1)]), │ 'egyptian': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 1), ('doc_4', 1), ('doc_5', 0)]), │ 'eight': OrderedDict([('doc_1', 0), ('doc_2', 1), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]), │ 'fox': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 1)]), │ 'god': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 1), ('doc_4', 0), ('doc_5', 0)]), │ 'in': OrderedDict([('doc_1', 1), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]), │ 'is': OrderedDict([('doc_1', 1), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]), │ 'jumps': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 1)]), │ 'largest': OrderedDict([('doc_1', 1), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]), │ 'lazy': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 1)]), │ 'of': OrderedDict([('doc_1', 0), ('doc_2', 1), ('doc_3', 0), ('doc_4', 1), ('doc_5', 0)]), │ 'over': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 1)]), │ 'pinnacle': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 1), ('doc_5', 0)]), │ 'planets': OrderedDict([('doc_1', 0), ('doc_2', 1), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]), │ 'pyramids': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 1), ('doc_5', 0)]), │ 'quick': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 1)]), │ 'ra': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 1), ('doc_4', 0), ('doc_5', 0)]), │ 'revolving': OrderedDict([('doc_1', 0), ('doc_2', 1), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]), │ 'solar': OrderedDict([('doc_1', 1), ('doc_2', 1), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]), │ 'sun': OrderedDict([('doc_1', 1), ('doc_2', 1), ('doc_3', 1), ('doc_4', 0), ('doc_5', 0)]), │ 'system': OrderedDict([('doc_1', 1), ('doc_2', 1), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]), │ 'the': OrderedDict([('doc_1', 3), ('doc_2', 2), ('doc_3', 1), ('doc_4', 2), ('doc_5', 2)]), │ 'was': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 1), ('doc_4', 0), ('doc_5', 0)]), │ 'were': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 1), ('doc_5', 0)]) }
Convert the dictionary
word_freqto a Pandas DataFrame. Thetranspose()method is used to transpose the DataFrame, so that the document names are the column names and the words are the row names.
df = pd.DataFrame(word_freq).fillna(0).astype(int).T
df
| doc_1 | doc_2 | doc_3 | doc_4 | doc_5 | |
|---|---|---|---|---|---|
| and | 0 | 1 | 0 | 0 | 0 |
| architecture | 0 | 0 | 0 | 1 | 0 |
| body | 1 | 0 | 0 | 0 | 0 |
| brown | 0 | 0 | 0 | 0 | 1 |
| celestial | 1 | 0 | 0 | 0 | 0 |
| consists | 0 | 1 | 0 | 0 | 0 |
| dog | 0 | 0 | 0 | 0 | 1 |
| egyptian | 0 | 0 | 1 | 1 | 0 |
| eight | 0 | 1 | 0 | 0 | 0 |
| fox | 0 | 0 | 0 | 0 | 1 |
| god | 0 | 0 | 1 | 0 | 0 |
| in | 1 | 0 | 0 | 0 | 0 |
| is | 1 | 0 | 0 | 0 | 0 |
| jumps | 0 | 0 | 0 | 0 | 1 |
| largest | 1 | 0 | 0 | 0 | 0 |
| lazy | 0 | 0 | 0 | 0 | 1 |
| of | 0 | 1 | 0 | 1 | 0 |
| over | 0 | 0 | 0 | 0 | 1 |
| pinnacle | 0 | 0 | 0 | 1 | 0 |
| planets | 0 | 1 | 0 | 0 | 0 |
| pyramids | 0 | 0 | 0 | 1 | 0 |
| quick | 0 | 0 | 0 | 0 | 1 |
| ra | 0 | 0 | 1 | 0 | 0 |
| revolving | 0 | 1 | 0 | 0 | 0 |
| solar | 1 | 1 | 0 | 0 | 0 |
| sun | 1 | 1 | 1 | 0 | 0 |
| system | 1 | 1 | 0 | 0 | 0 |
| the | 3 | 2 | 1 | 2 | 2 |
| was | 0 | 0 | 1 | 0 | 0 |
| were | 0 | 0 | 0 | 1 | 0 |
Convert the DataFrame to a NumPy array. Essentially turning the corpus/df into a matrix of shape \(n_{words} \times n_{docs}\).
X = df.values
print(f"Word frequency array shape: {X.shape}")
Word frequency array shape: (30, 5)
Sanity Check: We check our implementation by comparing the results to the results from the
CountVectorizerclass in thesklearn.feature_extraction.textmodule. Note that theCountVectorizerwill return a sparse matrix, so we need to convert it to a dense matrix using thetoarray()method. Furthermore, the matrix is in the shape \(n_{docs}:= D \times n_{words}: T\), so we need to transpose it using theTattribute.
np.testing.assert_array_equal(
X, count.toarray().T, err_msg="Our count vectorizer is wrong...?")
We have successfully implemented the count matrix. While the method is not very efficient at the moment of writing, it is what come to mind first. In reality, one could skip the dictionary portion and directly create a matrix of shape \(n_{words} \times n_{docs}\), and then fill it with the counts.
Getting the TF Matrix#
Create
X_tfwhich holds the term frequency values for each word in each document.\[ \operatorname{tf}_{t, d} = \log_{10}(\text{count}_{t, d} + 1) \]Create a new numpy array
X_tfwith the same shape and data type asX, but filled with uninitialized floats.Loop through each row and its index in
Xusing theenumerate()function.Calculate the term frequency of each word in the row using the formula
tf = log10(each_row + 1), where count is the frequency count of the word in the row. Now,each_rowis just the frequency of word \(t\) in document \(d\). So we can directly applynp.log10().Assign the calculated term frequency values to the corresponding row in
X_tfusing row_index.
X_tf = np.empty_like(X, dtype=float)
for row_index, each_row in enumerate(X):
tf = np.log10(each_row + 1)
X_tf[row_index] = tf
print(f"Term frequency array shape: {X_tf.shape}")
print(X_tf)
Term frequency array shape: (30, 5)
[[0. 0.30103 0. 0. 0. ]
[0. 0. 0. 0.30103 0. ]
[0.30103 0. 0. 0. 0. ]
[0. 0. 0. 0. 0.30103 ]
[0.30103 0. 0. 0. 0. ]
[0. 0.30103 0. 0. 0. ]
[0. 0. 0. 0. 0.30103 ]
[0. 0. 0.30103 0.30103 0. ]
[0. 0.30103 0. 0. 0. ]
[0. 0. 0. 0. 0.30103 ]
[0. 0. 0.30103 0. 0. ]
[0.30103 0. 0. 0. 0. ]
[0.30103 0. 0. 0. 0. ]
[0. 0. 0. 0. 0.30103 ]
[0.30103 0. 0. 0. 0. ]
[0. 0. 0. 0. 0.30103 ]
[0. 0.30103 0. 0.30103 0. ]
[0. 0. 0. 0. 0.30103 ]
[0. 0. 0. 0.30103 0. ]
[0. 0.30103 0. 0. 0. ]
[0. 0. 0. 0.30103 0. ]
[0. 0. 0. 0. 0.30103 ]
[0. 0. 0.30103 0. 0. ]
[0. 0.30103 0. 0. 0. ]
[0.30103 0.30103 0. 0. 0. ]
[0.30103 0.30103 0.30103 0. 0. ]
[0.30103 0.30103 0. 0. 0. ]
[0.60205999 0.47712125 0.30103 0.47712125 0.47712125]
[0. 0. 0.30103 0. 0. ]
[0. 0. 0. 0.30103 0. ]]
Creating the IDF vector#
Create
X_idfwhich holds the inverse document frequency values for each word in the corpus.\[ \text{idf}_{t} = \log_{10}\left(\frac{n_{docs}}{1 + \text{df}_{t}}\right) \]Create a new numpy array
X_idfwith the same shape and data type asX, but filled with uninitialized floats.Loop through each column and its index in
Xusing theenumerate()function.Calculate the inverse document frequency of each word in the column using the formula
idf = log10(n_docs / (1 + df)), wheredfis the document frequency of the word in the column.Assign the calculated inverse document frequency values to the corresponding column in
X_idfusing col_index.
X_idf = np.zeros(num_vocabs, dtype=float)
for row_index, each_row in enumerate(X):
df = np.count_nonzero(each_row) # df is document frequency and it answers how many documents contain this word
idf = np.log(num_documents / df) + 1 # for eg, if we have 4 documents and a word that appears in 2 documents, then idf = log10(4/2) = 0.301
X_idf[row_index] = idf
print(f"IDF array shape: {X_idf.shape}")
print(X_idf)
IDF array shape: (30,)
[2.60943791 2.60943791 2.60943791 2.60943791 2.60943791 2.60943791
2.60943791 1.91629073 2.60943791 2.60943791 2.60943791 2.60943791
2.60943791 2.60943791 2.60943791 2.60943791 1.91629073 2.60943791
2.60943791 2.60943791 2.60943791 2.60943791 2.60943791 2.60943791
1.91629073 1.51082562 1.91629073 1. 2.60943791 2.60943791]
Sanity check!
np.testing.assert_allclose(
X_idf, idf_vec, rtol=1e-3, err_msg="Our idf_vec is not correct...?"
)
TF-IDF Matrix#
We broadcast X_idf to the shape of X_tf and multiply the two matrices element-wise to get the TF-IDF matrix.
X_tfidf = X_tf * X_idf.reshape(-1, 1)
X_tfidf.shape, X_tfidf
((30, 5),
array([[0. , 0.78551908, 0. , 0. , 0. ],
[0. , 0. , 0. , 0.78551908, 0. ],
[0.78551908, 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.78551908],
[0.78551908, 0. , 0. , 0. , 0. ],
[0. , 0.78551908, 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.78551908],
[0. , 0. , 0.57686099, 0.57686099, 0. ],
[0. , 0.78551908, 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.78551908],
[0. , 0. , 0.78551908, 0. , 0. ],
[0.78551908, 0. , 0. , 0. , 0. ],
[0.78551908, 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.78551908],
[0.78551908, 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.78551908],
[0. , 0.57686099, 0. , 0.57686099, 0. ],
[0. , 0. , 0. , 0. , 0.78551908],
[0. , 0. , 0. , 0.78551908, 0. ],
[0. , 0.78551908, 0. , 0. , 0. ],
[0. , 0. , 0. , 0.78551908, 0. ],
[0. , 0. , 0. , 0. , 0.78551908],
[0. , 0. , 0.78551908, 0. , 0. ],
[0. , 0.78551908, 0. , 0. , 0. ],
[0.57686099, 0.57686099, 0. , 0. , 0. ],
[0.45480383, 0.45480383, 0.45480383, 0. , 0. ],
[0.57686099, 0.57686099, 0. , 0. , 0. ],
[0.60205999, 0.47712125, 0.30103 , 0.47712125, 0.47712125],
[0. , 0. , 0.78551908, 0. , 0. ],
[0. , 0. , 0. , 0.78551908, 0. ]]))
If you do a sanity check now, there will be error!
The reason why this outcome is different from scikit-learn is because scikit-learn uses a different formula for calculating the TF-IDF.
For their term frequency, they use the raw count matrix instead of compressing it to a log scale.
Let’s replace X_tf with X and see.
X_tfidf = X * X_idf.reshape(-1, 1)
X_tfidf.shape, X_tfidf
((30, 5),
array([[0. , 2.60943791, 0. , 0. , 0. ],
[0. , 0. , 0. , 2.60943791, 0. ],
[2.60943791, 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 2.60943791],
[2.60943791, 0. , 0. , 0. , 0. ],
[0. , 2.60943791, 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 2.60943791],
[0. , 0. , 1.91629073, 1.91629073, 0. ],
[0. , 2.60943791, 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 2.60943791],
[0. , 0. , 2.60943791, 0. , 0. ],
[2.60943791, 0. , 0. , 0. , 0. ],
[2.60943791, 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 2.60943791],
[2.60943791, 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 2.60943791],
[0. , 1.91629073, 0. , 1.91629073, 0. ],
[0. , 0. , 0. , 0. , 2.60943791],
[0. , 0. , 0. , 2.60943791, 0. ],
[0. , 2.60943791, 0. , 0. , 0. ],
[0. , 0. , 0. , 2.60943791, 0. ],
[0. , 0. , 0. , 0. , 2.60943791],
[0. , 0. , 2.60943791, 0. , 0. ],
[0. , 2.60943791, 0. , 0. , 0. ],
[1.91629073, 1.91629073, 0. , 0. , 0. ],
[1.51082562, 1.51082562, 1.51082562, 0. , 0. ],
[1.91629073, 1.91629073, 0. , 0. , 0. ],
[3. , 2. , 1. , 2. , 2. ],
[0. , 0. , 2.60943791, 0. , 0. ],
[0. , 0. , 0. , 2.60943791, 0. ]]))
np.testing.assert_allclose(
X_tfidf,
tfidf_matrix.toarray().T,
rtol=1e-3,
err_msg="Our implementation of tfidf is not correct...?"
)
The results now match.