Implementation#
Let’s implement TF-IDF from scratch.
from collections import OrderedDict
import numpy as np
import pandas as pd
from rich.pretty import pprint
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
Defining the Corpus#
We use a simple corpus with 5 documents, defined under the variable corpus
.
corpus = [
"The sun is the largest celestial body in the solar system",
"The solar system consists of the sun and eight revolving planets",
"Ra was the Egyptian Sun God",
"The Pyramids were the pinnacle of Egyptian architecture",
"The quick brown fox jumps over the lazy dog",
]
corpus_names = ["doc_1", "doc_2", "doc_3", "doc_4", "doc_5"]
num_documents = len(corpus)
vocabulary = " ".join(corpus).split() # split the corpus into words
vocabulary = [word.lower() for word in vocabulary] # lower case
vocabulary = list(set(vocabulary)) # unique vocabulary
vocabulary.sort() # sort alphabetically
num_vocabs = len(vocabulary)
print(f"Number of documents: {num_documents}")
print(f"Number of unique words: {num_vocabs}")
Number of documents: 5
Number of unique words: 30
So far, we have performed lowercasing as the only preprocessing step.
The num_documents
and num_vocabs
encapulates the number of documents and the number of unique words in the corpus, respectively.
Scikit-Learn’s Implementation#
Let’s see how scikit-learn
implements TF-IDF by running it on the corpus we defined above.
pipe = Pipeline(
[
("count", CountVectorizer(vocabulary=vocabulary)),
("tfid", TfidfTransformer(norm=None, smooth_idf=False, sublinear_tf=False)),
]
).fit(corpus)
tfidf_matrix = pipe.transform(corpus)
print(f"Shape of tfidf matrix: {tfidf_matrix.shape}")
print(tfidf_matrix.toarray().T)
Shape of tfidf matrix: (5, 30)
[[0. 2.60943791 0. 0. 0. ]
[0. 0. 0. 2.60943791 0. ]
[2.60943791 0. 0. 0. 0. ]
[0. 0. 0. 0. 2.60943791]
[2.60943791 0. 0. 0. 0. ]
[0. 2.60943791 0. 0. 0. ]
[0. 0. 0. 0. 2.60943791]
[0. 0. 1.91629073 1.91629073 0. ]
[0. 2.60943791 0. 0. 0. ]
[0. 0. 0. 0. 2.60943791]
[0. 0. 2.60943791 0. 0. ]
[2.60943791 0. 0. 0. 0. ]
[2.60943791 0. 0. 0. 0. ]
[0. 0. 0. 0. 2.60943791]
[2.60943791 0. 0. 0. 0. ]
[0. 0. 0. 0. 2.60943791]
[0. 1.91629073 0. 1.91629073 0. ]
[0. 0. 0. 0. 2.60943791]
[0. 0. 0. 2.60943791 0. ]
[0. 2.60943791 0. 0. 0. ]
[0. 0. 0. 2.60943791 0. ]
[0. 0. 0. 0. 2.60943791]
[0. 0. 2.60943791 0. 0. ]
[0. 2.60943791 0. 0. 0. ]
[1.91629073 1.91629073 0. 0. 0. ]
[1.51082562 1.51082562 1.51082562 0. 0. ]
[1.91629073 1.91629073 0. 0. 0. ]
[3. 2. 1. 2. 2. ]
[0. 0. 2.60943791 0. 0. ]
[0. 0. 0. 2.60943791 0. ]]
This code defines a pipeline that applies two scikit-learn
transformers to a corpus of
text documents, and then fits the pipeline to the corpus. The first
transformer is CountVectorizer
, which converts the text into a matrix
of token counts. The vocabulary
argument specifies the set of tokens to
use. The second transformer is TfidfTransformer
, which applies term
frequency-inverse document frequency (TF-IDF) weighting to the count
matrix.
The norm
, smooth_idf
, and sublinear_tf
parameters of
TfidfTransformer
control the normalization and smoothing of the IDF
values and the sublinear scaling of the term frequencies. By setting
norm
to None
, the transformer does not normalize the document
vectors, so they will have the same length as the original count vectors.
By setting smooth_idf
and sublinear_tf
to False
, the transformer
does not apply any smoothing or scaling to the IDF and term frequency
values.
The fitted pipeline is then applied to the corpus
to obtain a TF-IDF
matrix, which is stored in tfidf_matrix
. The shape
attribute of
tfidf_matrix
gives the dimensions of the matrix, which is a sparse
matrix with the number of rows equal to the number of documents in the
corpus and the number of columns equal to the number of unique tokens in
the vocabulary. The toarray()
method is used to convert the sparse
matrix to a dense matrix, and the T
attribute is used to transpose the
matrix so that the rows correspond to tokens and the columns correspond
to documents. The resulting matrix is printed to the console.
count = pipe["count"].transform(corpus)
print(f"Count Matrix: {count.shape}")
print(count.toarray().T)
Count Matrix: (5, 30)
[[0 1 0 0 0]
[0 0 0 1 0]
[1 0 0 0 0]
[0 0 0 0 1]
[1 0 0 0 0]
[0 1 0 0 0]
[0 0 0 0 1]
[0 0 1 1 0]
[0 1 0 0 0]
[0 0 0 0 1]
[0 0 1 0 0]
[1 0 0 0 0]
[1 0 0 0 0]
[0 0 0 0 1]
[1 0 0 0 0]
[0 0 0 0 1]
[0 1 0 1 0]
[0 0 0 0 1]
[0 0 0 1 0]
[0 1 0 0 0]
[0 0 0 1 0]
[0 0 0 0 1]
[0 0 1 0 0]
[0 1 0 0 0]
[1 1 0 0 0]
[1 1 1 0 0]
[1 1 0 0 0]
[3 2 1 2 2]
[0 0 1 0 0]
[0 0 0 1 0]]
This code retrieves the count matrix and IDF vector that were generated by the pipeline for a given corpus of text documents.
The first part applies the CountVectorizer
transformer to the corpus
using the transform()
method of the count
step in the pipeline, which
returns a sparse matrix of token counts. The toarray()
method is then
called to convert the sparse matrix to a dense array, which is assigned to
the variable count
.
The second part retrieves the IDF values that were computed by the
TfidfTransformer
transformer using the idf_
attribute, which returns
a one-dimensional array of IDF values for each token in the vocabulary.
The IDF vector is assigned to the variable idf_vec
.
idf_vec = pipe["tfid"].idf_
print(f"IDF Vector: {idf_vec.shape}")
print(idf_vec.T)
IDF Vector: (30,)
[2.60943791 2.60943791 2.60943791 2.60943791 2.60943791 2.60943791
2.60943791 1.91629073 2.60943791 2.60943791 2.60943791 2.60943791
2.60943791 2.60943791 2.60943791 2.60943791 1.91629073 2.60943791
2.60943791 2.60943791 2.60943791 2.60943791 2.60943791 2.60943791
1.91629073 1.51082562 1.91629073 1. 2.60943791 2.60943791]
Let’s try to implement it ourself.
Implementing TF-IDF#
Getting the Count Matrix#
The code below creates an ordered dictionary word_freq
to store the frequency of each word in each document in the corpus. The structure of the dictionary is {word: {doc_1: freq, doc_2: freq, ...}, ...}
.
The code loops through each word in the vocabulary, and for each word, it checks if it is already in the
word_freq
dictionary. If it is not, it adds the word to the dictionary.It then loops through each document and for each document, it checks if the document name is already in the
word_freq[word]
dictionary. If it is not, it adds the document name to the dictionary.Finally, it counts the number of times the word appears in the document using the
count()
method of Python strings, and adds this count to the corresponding entry in theword_freq
dictionary.The
pprint()
function from thepprint
module is then used to print the dictionaryword_freq
in a readable format.
# Create an ordered dictionary to store the frequency of each word in each document in the corpus
word_freq = OrderedDict() # {word: {doc_1: freq, doc_2: freq, ...}, ...}
# Loop through each word in the vocabulary
for word in vocabulary:
# If the word is not in the word_freq dictionary, add it
if word not in word_freq:
word_freq[word] = OrderedDict()
# Loop through each document and count the number of times the word appears in the document
for doc, doc_name in zip(corpus, corpus_names):
# If the document name is not in the word_freq[word] dictionary, add it
if doc_name not in word_freq[word]:
word_freq[word][doc_name] = 0
# Count the number of times the word appears in the document using the count() method
word_freq[word][doc_name] += doc.lower().split().count(word)
# Print the word_freq dictionary using the pprint function for pretty printing
pprint(dict(word_freq.items()))
{ │ 'and': OrderedDict([('doc_1', 0), ('doc_2', 1), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]), │ 'architecture': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 1), ('doc_5', 0)]), │ 'body': OrderedDict([('doc_1', 1), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]), │ 'brown': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 1)]), │ 'celestial': OrderedDict([('doc_1', 1), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]), │ 'consists': OrderedDict([('doc_1', 0), ('doc_2', 1), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]), │ 'dog': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 1)]), │ 'egyptian': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 1), ('doc_4', 1), ('doc_5', 0)]), │ 'eight': OrderedDict([('doc_1', 0), ('doc_2', 1), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]), │ 'fox': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 1)]), │ 'god': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 1), ('doc_4', 0), ('doc_5', 0)]), │ 'in': OrderedDict([('doc_1', 1), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]), │ 'is': OrderedDict([('doc_1', 1), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]), │ 'jumps': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 1)]), │ 'largest': OrderedDict([('doc_1', 1), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]), │ 'lazy': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 1)]), │ 'of': OrderedDict([('doc_1', 0), ('doc_2', 1), ('doc_3', 0), ('doc_4', 1), ('doc_5', 0)]), │ 'over': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 1)]), │ 'pinnacle': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 1), ('doc_5', 0)]), │ 'planets': OrderedDict([('doc_1', 0), ('doc_2', 1), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]), │ 'pyramids': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 1), ('doc_5', 0)]), │ 'quick': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 1)]), │ 'ra': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 1), ('doc_4', 0), ('doc_5', 0)]), │ 'revolving': OrderedDict([('doc_1', 0), ('doc_2', 1), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]), │ 'solar': OrderedDict([('doc_1', 1), ('doc_2', 1), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]), │ 'sun': OrderedDict([('doc_1', 1), ('doc_2', 1), ('doc_3', 1), ('doc_4', 0), ('doc_5', 0)]), │ 'system': OrderedDict([('doc_1', 1), ('doc_2', 1), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]), │ 'the': OrderedDict([('doc_1', 3), ('doc_2', 2), ('doc_3', 1), ('doc_4', 2), ('doc_5', 2)]), │ 'was': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 1), ('doc_4', 0), ('doc_5', 0)]), │ 'were': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 1), ('doc_5', 0)]) }
Convert the dictionary
word_freq
to a Pandas DataFrame. Thetranspose()
method is used to transpose the DataFrame, so that the document names are the column names and the words are the row names.
df = pd.DataFrame(word_freq).fillna(0).astype(int).T
df
doc_1 | doc_2 | doc_3 | doc_4 | doc_5 | |
---|---|---|---|---|---|
and | 0 | 1 | 0 | 0 | 0 |
architecture | 0 | 0 | 0 | 1 | 0 |
body | 1 | 0 | 0 | 0 | 0 |
brown | 0 | 0 | 0 | 0 | 1 |
celestial | 1 | 0 | 0 | 0 | 0 |
consists | 0 | 1 | 0 | 0 | 0 |
dog | 0 | 0 | 0 | 0 | 1 |
egyptian | 0 | 0 | 1 | 1 | 0 |
eight | 0 | 1 | 0 | 0 | 0 |
fox | 0 | 0 | 0 | 0 | 1 |
god | 0 | 0 | 1 | 0 | 0 |
in | 1 | 0 | 0 | 0 | 0 |
is | 1 | 0 | 0 | 0 | 0 |
jumps | 0 | 0 | 0 | 0 | 1 |
largest | 1 | 0 | 0 | 0 | 0 |
lazy | 0 | 0 | 0 | 0 | 1 |
of | 0 | 1 | 0 | 1 | 0 |
over | 0 | 0 | 0 | 0 | 1 |
pinnacle | 0 | 0 | 0 | 1 | 0 |
planets | 0 | 1 | 0 | 0 | 0 |
pyramids | 0 | 0 | 0 | 1 | 0 |
quick | 0 | 0 | 0 | 0 | 1 |
ra | 0 | 0 | 1 | 0 | 0 |
revolving | 0 | 1 | 0 | 0 | 0 |
solar | 1 | 1 | 0 | 0 | 0 |
sun | 1 | 1 | 1 | 0 | 0 |
system | 1 | 1 | 0 | 0 | 0 |
the | 3 | 2 | 1 | 2 | 2 |
was | 0 | 0 | 1 | 0 | 0 |
were | 0 | 0 | 0 | 1 | 0 |
Convert the DataFrame to a NumPy array. Essentially turning the corpus/df into a matrix of shape \(n_{words} \times n_{docs}\).
X = df.values
print(f"Word frequency array shape: {X.shape}")
Word frequency array shape: (30, 5)
Sanity Check: We check our implementation by comparing the results to the results from the
CountVectorizer
class in thesklearn.feature_extraction.text
module. Note that theCountVectorizer
will return a sparse matrix, so we need to convert it to a dense matrix using thetoarray()
method. Furthermore, the matrix is in the shape \(n_{docs}:= D \times n_{words}: T\), so we need to transpose it using theT
attribute.
np.testing.assert_array_equal(
X, count.toarray().T, err_msg="Our count vectorizer is wrong...?")
We have successfully implemented the count matrix. While the method is not very efficient at the moment of writing, it is what come to mind first. In reality, one could skip the dictionary portion and directly create a matrix of shape \(n_{words} \times n_{docs}\), and then fill it with the counts.
Getting the TF Matrix#
Create
X_tf
which holds the term frequency values for each word in each document.\[ \operatorname{tf}_{t, d} = \log_{10}(\text{count}_{t, d} + 1) \]Create a new numpy array
X_tf
with the same shape and data type asX
, but filled with uninitialized floats.Loop through each row and its index in
X
using theenumerate()
function.Calculate the term frequency of each word in the row using the formula
tf = log10(each_row + 1)
, where count is the frequency count of the word in the row. Now,each_row
is just the frequency of word \(t\) in document \(d\). So we can directly applynp.log10()
.Assign the calculated term frequency values to the corresponding row in
X_tf
using row_index.
X_tf = np.empty_like(X, dtype=float)
for row_index, each_row in enumerate(X):
tf = np.log10(each_row + 1)
X_tf[row_index] = tf
print(f"Term frequency array shape: {X_tf.shape}")
print(X_tf)
Term frequency array shape: (30, 5)
[[0. 0.30103 0. 0. 0. ]
[0. 0. 0. 0.30103 0. ]
[0.30103 0. 0. 0. 0. ]
[0. 0. 0. 0. 0.30103 ]
[0.30103 0. 0. 0. 0. ]
[0. 0.30103 0. 0. 0. ]
[0. 0. 0. 0. 0.30103 ]
[0. 0. 0.30103 0.30103 0. ]
[0. 0.30103 0. 0. 0. ]
[0. 0. 0. 0. 0.30103 ]
[0. 0. 0.30103 0. 0. ]
[0.30103 0. 0. 0. 0. ]
[0.30103 0. 0. 0. 0. ]
[0. 0. 0. 0. 0.30103 ]
[0.30103 0. 0. 0. 0. ]
[0. 0. 0. 0. 0.30103 ]
[0. 0.30103 0. 0.30103 0. ]
[0. 0. 0. 0. 0.30103 ]
[0. 0. 0. 0.30103 0. ]
[0. 0.30103 0. 0. 0. ]
[0. 0. 0. 0.30103 0. ]
[0. 0. 0. 0. 0.30103 ]
[0. 0. 0.30103 0. 0. ]
[0. 0.30103 0. 0. 0. ]
[0.30103 0.30103 0. 0. 0. ]
[0.30103 0.30103 0.30103 0. 0. ]
[0.30103 0.30103 0. 0. 0. ]
[0.60205999 0.47712125 0.30103 0.47712125 0.47712125]
[0. 0. 0.30103 0. 0. ]
[0. 0. 0. 0.30103 0. ]]
Creating the IDF vector#
Create
X_idf
which holds the inverse document frequency values for each word in the corpus.\[ \text{idf}_{t} = \log_{10}\left(\frac{n_{docs}}{1 + \text{df}_{t}}\right) \]Create a new numpy array
X_idf
with the same shape and data type asX
, but filled with uninitialized floats.Loop through each column and its index in
X
using theenumerate()
function.Calculate the inverse document frequency of each word in the column using the formula
idf = log10(n_docs / (1 + df))
, wheredf
is the document frequency of the word in the column.Assign the calculated inverse document frequency values to the corresponding column in
X_idf
using col_index.
X_idf = np.zeros(num_vocabs, dtype=float)
for row_index, each_row in enumerate(X):
df = np.count_nonzero(each_row) # df is document frequency and it answers how many documents contain this word
idf = np.log(num_documents / df) + 1 # for eg, if we have 4 documents and a word that appears in 2 documents, then idf = log10(4/2) = 0.301
X_idf[row_index] = idf
print(f"IDF array shape: {X_idf.shape}")
print(X_idf)
IDF array shape: (30,)
[2.60943791 2.60943791 2.60943791 2.60943791 2.60943791 2.60943791
2.60943791 1.91629073 2.60943791 2.60943791 2.60943791 2.60943791
2.60943791 2.60943791 2.60943791 2.60943791 1.91629073 2.60943791
2.60943791 2.60943791 2.60943791 2.60943791 2.60943791 2.60943791
1.91629073 1.51082562 1.91629073 1. 2.60943791 2.60943791]
Sanity check!
np.testing.assert_allclose(
X_idf, idf_vec, rtol=1e-3, err_msg="Our idf_vec is not correct...?"
)
TF-IDF Matrix#
We broadcast X_idf
to the shape of X_tf
and multiply the two matrices element-wise to get the TF-IDF matrix.
X_tfidf = X_tf * X_idf.reshape(-1, 1)
X_tfidf.shape, X_tfidf
((30, 5),
array([[0. , 0.78551908, 0. , 0. , 0. ],
[0. , 0. , 0. , 0.78551908, 0. ],
[0.78551908, 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.78551908],
[0.78551908, 0. , 0. , 0. , 0. ],
[0. , 0.78551908, 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.78551908],
[0. , 0. , 0.57686099, 0.57686099, 0. ],
[0. , 0.78551908, 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.78551908],
[0. , 0. , 0.78551908, 0. , 0. ],
[0.78551908, 0. , 0. , 0. , 0. ],
[0.78551908, 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.78551908],
[0.78551908, 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.78551908],
[0. , 0.57686099, 0. , 0.57686099, 0. ],
[0. , 0. , 0. , 0. , 0.78551908],
[0. , 0. , 0. , 0.78551908, 0. ],
[0. , 0.78551908, 0. , 0. , 0. ],
[0. , 0. , 0. , 0.78551908, 0. ],
[0. , 0. , 0. , 0. , 0.78551908],
[0. , 0. , 0.78551908, 0. , 0. ],
[0. , 0.78551908, 0. , 0. , 0. ],
[0.57686099, 0.57686099, 0. , 0. , 0. ],
[0.45480383, 0.45480383, 0.45480383, 0. , 0. ],
[0.57686099, 0.57686099, 0. , 0. , 0. ],
[0.60205999, 0.47712125, 0.30103 , 0.47712125, 0.47712125],
[0. , 0. , 0.78551908, 0. , 0. ],
[0. , 0. , 0. , 0.78551908, 0. ]]))
If you do a sanity check now, there will be error!
The reason why this outcome is different from scikit-learn
is because scikit-learn
uses a different formula for calculating the TF-IDF.
For their term frequency, they use the raw count matrix instead of compressing it to a log scale.
Let’s replace X_tf
with X
and see.
X_tfidf = X * X_idf.reshape(-1, 1)
X_tfidf.shape, X_tfidf
((30, 5),
array([[0. , 2.60943791, 0. , 0. , 0. ],
[0. , 0. , 0. , 2.60943791, 0. ],
[2.60943791, 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 2.60943791],
[2.60943791, 0. , 0. , 0. , 0. ],
[0. , 2.60943791, 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 2.60943791],
[0. , 0. , 1.91629073, 1.91629073, 0. ],
[0. , 2.60943791, 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 2.60943791],
[0. , 0. , 2.60943791, 0. , 0. ],
[2.60943791, 0. , 0. , 0. , 0. ],
[2.60943791, 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 2.60943791],
[2.60943791, 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 2.60943791],
[0. , 1.91629073, 0. , 1.91629073, 0. ],
[0. , 0. , 0. , 0. , 2.60943791],
[0. , 0. , 0. , 2.60943791, 0. ],
[0. , 2.60943791, 0. , 0. , 0. ],
[0. , 0. , 0. , 2.60943791, 0. ],
[0. , 0. , 0. , 0. , 2.60943791],
[0. , 0. , 2.60943791, 0. , 0. ],
[0. , 2.60943791, 0. , 0. , 0. ],
[1.91629073, 1.91629073, 0. , 0. , 0. ],
[1.51082562, 1.51082562, 1.51082562, 0. , 0. ],
[1.91629073, 1.91629073, 0. , 0. , 0. ],
[3. , 2. , 1. , 2. , 2. ],
[0. , 0. , 2.60943791, 0. , 0. ],
[0. , 0. , 0. , 2.60943791, 0. ]]))
np.testing.assert_allclose(
X_tfidf,
tfidf_matrix.toarray().T,
rtol=1e-3,
err_msg="Our implementation of tfidf is not correct...?"
)
The results now match.