Implementation

Implementation#

Let’s implement TF-IDF from scratch.

from collections import OrderedDict

import numpy as np
import pandas as pd
from rich.pretty import pprint
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline

Defining the Corpus#

We use a simple corpus with 5 documents, defined under the variable corpus.

corpus = [
    "The sun is the largest celestial body in the solar system",
    "The solar system consists of the sun and eight revolving planets",
    "Ra was the Egyptian Sun God",
    "The Pyramids were the pinnacle of Egyptian architecture",
    "The quick brown fox jumps over the lazy dog",
]

corpus_names = ["doc_1", "doc_2", "doc_3", "doc_4", "doc_5"]

num_documents = len(corpus)
vocabulary = " ".join(corpus).split() # split the corpus into words
vocabulary = [word.lower() for word in vocabulary]  # lower case
vocabulary = list(set(vocabulary))  # unique vocabulary
vocabulary.sort()  # sort alphabetically
num_vocabs = len(vocabulary)

print(f"Number of documents: {num_documents}")
print(f"Number of unique words: {num_vocabs}")

Number of documents: 5
Number of unique words: 30

So far, we have performed lowercasing as the only preprocessing step.

The num_documents and num_vocabs encapulates the number of documents and the number of unique words in the corpus, respectively.

Scikit-Learn’s Implementation#

Let’s see how scikit-learn implements TF-IDF by running it on the corpus we defined above.

pipe = Pipeline(
    [
        ("count", CountVectorizer(vocabulary=vocabulary)),
        ("tfid", TfidfTransformer(norm=None, smooth_idf=False, sublinear_tf=False)),
    ]
).fit(corpus)

tfidf_matrix = pipe.transform(corpus)
print(f"Shape of tfidf matrix: {tfidf_matrix.shape}")
print(tfidf_matrix.toarray().T)

Shape of tfidf matrix: (5, 30)
[[0.         2.60943791 0.         0.         0.        ]
 [0.         0.         0.         2.60943791 0.        ]
 [2.60943791 0.         0.         0.         0.        ]
 [0.         0.         0.         0.         2.60943791]
 [2.60943791 0.         0.         0.         0.        ]
 [0.         2.60943791 0.         0.         0.        ]
 [0.         0.         0.         0.         2.60943791]
 [0.         0.         1.91629073 1.91629073 0.        ]
 [0.         2.60943791 0.         0.         0.        ]
 [0.         0.         0.         0.         2.60943791]
 [0.         0.         2.60943791 0.         0.        ]
 [2.60943791 0.         0.         0.         0.        ]
 [2.60943791 0.         0.         0.         0.        ]
 [0.         0.         0.         0.         2.60943791]
 [2.60943791 0.         0.         0.         0.        ]
 [0.         0.         0.         0.         2.60943791]
 [0.         1.91629073 0.         1.91629073 0.        ]
 [0.         0.         0.         0.         2.60943791]
 [0.         0.         0.         2.60943791 0.        ]
 [0.         2.60943791 0.         0.         0.        ]
 [0.         0.         0.         2.60943791 0.        ]
 [0.         0.         0.         0.         2.60943791]
 [0.         0.         2.60943791 0.         0.        ]
 [0.         2.60943791 0.         0.         0.        ]
 [1.91629073 1.91629073 0.         0.         0.        ]
 [1.51082562 1.51082562 1.51082562 0.         0.        ]
 [1.91629073 1.91629073 0.         0.         0.        ]
 [3.         2.         1.         2.         2.        ]
 [0.         0.         2.60943791 0.         0.        ]
 [0.         0.         0.         2.60943791 0.        ]]

This code defines a pipeline that applies two scikit-learn transformers to a corpus of text documents, and then fits the pipeline to the corpus. The first transformer is CountVectorizer, which converts the text into a matrix of token counts. The vocabulary argument specifies the set of tokens to use. The second transformer is TfidfTransformer, which applies term frequency-inverse document frequency (TF-IDF) weighting to the count matrix.

The norm, smooth_idf, and sublinear_tf parameters of TfidfTransformer control the normalization and smoothing of the IDF values and the sublinear scaling of the term frequencies. By setting norm to None, the transformer does not normalize the document vectors, so they will have the same length as the original count vectors. By setting smooth_idf and sublinear_tf to False, the transformer does not apply any smoothing or scaling to the IDF and term frequency values.

The fitted pipeline is then applied to the corpus to obtain a TF-IDF matrix, which is stored in tfidf_matrix. The shape attribute of tfidf_matrix gives the dimensions of the matrix, which is a sparse matrix with the number of rows equal to the number of documents in the corpus and the number of columns equal to the number of unique tokens in the vocabulary. The toarray() method is used to convert the sparse matrix to a dense matrix, and the T attribute is used to transpose the matrix so that the rows correspond to tokens and the columns correspond to documents. The resulting matrix is printed to the console.

count = pipe["count"].transform(corpus)
print(f"Count Matrix: {count.shape}")
print(count.toarray().T)

Count Matrix: (5, 30)
[[0 1 0 0 0]
 [0 0 0 1 0]
 [1 0 0 0 0]
 [0 0 0 0 1]
 [1 0 0 0 0]
 [0 1 0 0 0]
 [0 0 0 0 1]
 [0 0 1 1 0]
 [0 1 0 0 0]
 [0 0 0 0 1]
 [0 0 1 0 0]
 [1 0 0 0 0]
 [1 0 0 0 0]
 [0 0 0 0 1]
 [1 0 0 0 0]
 [0 0 0 0 1]
 [0 1 0 1 0]
 [0 0 0 0 1]
 [0 0 0 1 0]
 [0 1 0 0 0]
 [0 0 0 1 0]
 [0 0 0 0 1]
 [0 0 1 0 0]
 [0 1 0 0 0]
 [1 1 0 0 0]
 [1 1 1 0 0]
 [1 1 0 0 0]
 [3 2 1 2 2]
 [0 0 1 0 0]
 [0 0 0 1 0]]

This code retrieves the count matrix and IDF vector that were generated by the pipeline for a given corpus of text documents.

The first part applies the CountVectorizer transformer to the corpus using the transform() method of the count step in the pipeline, which returns a sparse matrix of token counts. The toarray() method is then called to convert the sparse matrix to a dense array, which is assigned to the variable count.

The second part retrieves the IDF values that were computed by the TfidfTransformer transformer using the idf_ attribute, which returns a one-dimensional array of IDF values for each token in the vocabulary. The IDF vector is assigned to the variable idf_vec.

idf_vec = pipe["tfid"].idf_
print(f"IDF Vector: {idf_vec.shape}")
print(idf_vec.T)

IDF Vector: (30,)
[2.60943791 2.60943791 2.60943791 2.60943791 2.60943791 2.60943791
 2.60943791 1.91629073 2.60943791 2.60943791 2.60943791 2.60943791
 2.60943791 2.60943791 2.60943791 2.60943791 1.91629073 2.60943791
 2.60943791 2.60943791 2.60943791 2.60943791 2.60943791 2.60943791
 1.91629073 1.51082562 1.91629073 1.         2.60943791 2.60943791]

Let’s try to implement it ourself.

Implementing TF-IDF#

Getting the Count Matrix#

The code below creates an ordered dictionary word_freq to store the frequency of each word in each document in the corpus. The structure of the dictionary is {word: {doc_1: freq, doc_2: freq, ...}, ...}.

The code loops through each word in the vocabulary, and for each word, it checks if it is already in the word_freq dictionary. If it is not, it adds the word to the dictionary.
It then loops through each document and for each document, it checks if the document name is already in the word_freq[word] dictionary. If it is not, it adds the document name to the dictionary.
Finally, it counts the number of times the word appears in the document using the count() method of Python strings, and adds this count to the corresponding entry in the word_freq dictionary.
The pprint() function from the pprint module is then used to print the dictionary word_freq in a readable format.

# Create an ordered dictionary to store the frequency of each word in each document in the corpus
word_freq = OrderedDict() # {word: {doc_1: freq, doc_2: freq, ...}, ...}

# Loop through each word in the vocabulary
for word in vocabulary:
    
    # If the word is not in the word_freq dictionary, add it
    if word not in word_freq:
        word_freq[word] = OrderedDict()
    
    # Loop through each document and count the number of times the word appears in the document
    for doc, doc_name in zip(corpus, corpus_names):
        
        # If the document name is not in the word_freq[word] dictionary, add it
        if doc_name not in word_freq[word]:
            word_freq[word][doc_name] = 0
        
        # Count the number of times the word appears in the document using the count() method
        word_freq[word][doc_name] += doc.lower().split().count(word)

# Print the word_freq dictionary using the pprint function for pretty printing
pprint(dict(word_freq.items()))

{
│   'and': OrderedDict([('doc_1', 0), ('doc_2', 1), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]),
│   'architecture': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 1), ('doc_5', 0)]),
│   'body': OrderedDict([('doc_1', 1), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]),
│   'brown': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 1)]),
│   'celestial': OrderedDict([('doc_1', 1), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]),
│   'consists': OrderedDict([('doc_1', 0), ('doc_2', 1), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]),
│   'dog': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 1)]),
│   'egyptian': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 1), ('doc_4', 1), ('doc_5', 0)]),
│   'eight': OrderedDict([('doc_1', 0), ('doc_2', 1), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]),
│   'fox': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 1)]),
│   'god': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 1), ('doc_4', 0), ('doc_5', 0)]),
│   'in': OrderedDict([('doc_1', 1), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]),
│   'is': OrderedDict([('doc_1', 1), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]),
│   'jumps': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 1)]),
│   'largest': OrderedDict([('doc_1', 1), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]),
│   'lazy': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 1)]),
│   'of': OrderedDict([('doc_1', 0), ('doc_2', 1), ('doc_3', 0), ('doc_4', 1), ('doc_5', 0)]),
│   'over': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 1)]),
│   'pinnacle': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 1), ('doc_5', 0)]),
│   'planets': OrderedDict([('doc_1', 0), ('doc_2', 1), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]),
│   'pyramids': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 1), ('doc_5', 0)]),
│   'quick': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 0), ('doc_5', 1)]),
│   'ra': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 1), ('doc_4', 0), ('doc_5', 0)]),
│   'revolving': OrderedDict([('doc_1', 0), ('doc_2', 1), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]),
│   'solar': OrderedDict([('doc_1', 1), ('doc_2', 1), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]),
│   'sun': OrderedDict([('doc_1', 1), ('doc_2', 1), ('doc_3', 1), ('doc_4', 0), ('doc_5', 0)]),
│   'system': OrderedDict([('doc_1', 1), ('doc_2', 1), ('doc_3', 0), ('doc_4', 0), ('doc_5', 0)]),
│   'the': OrderedDict([('doc_1', 3), ('doc_2', 2), ('doc_3', 1), ('doc_4', 2), ('doc_5', 2)]),
│   'was': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 1), ('doc_4', 0), ('doc_5', 0)]),
│   'were': OrderedDict([('doc_1', 0), ('doc_2', 0), ('doc_3', 0), ('doc_4', 1), ('doc_5', 0)])
}

Convert the dictionary word_freq to a Pandas DataFrame. The transpose() method is used to transpose the DataFrame, so that the document names are the column names and the words are the row names.

df = pd.DataFrame(word_freq).fillna(0).astype(int).T
df

	doc_1	doc_2	doc_3	doc_4	doc_5
and	0	1	0	0	0
architecture	0	0	0	1	0
body	1	0	0	0	0
brown	0	0	0	0	1
celestial	1	0	0	0	0
consists	0	1	0	0	0
dog	0	0	0	0	1
egyptian	0	0	1	1	0
eight	0	1	0	0	0
fox	0	0	0	0	1
god	0	0	1	0	0
in	1	0	0	0	0
is	1	0	0	0	0
jumps	0	0	0	0	1
largest	1	0	0	0	0
lazy	0	0	0	0	1
of	0	1	0	1	0
over	0	0	0	0	1
pinnacle	0	0	0	1	0
planets	0	1	0	0	0
pyramids	0	0	0	1	0
quick	0	0	0	0	1
ra	0	0	1	0	0
revolving	0	1	0	0	0
solar	1	1	0	0	0
sun	1	1	1	0	0
system	1	1	0	0	0
the	3	2	1	2	2
was	0	0	1	0	0
were	0	0	0	1	0

Convert the DataFrame to a NumPy array. Essentially turning the corpus/df into a matrix of shape \(n_{words} \times n_{docs}\).

X = df.values
print(f"Word frequency array shape: {X.shape}")

Word frequency array shape: (30, 5)

Sanity Check: We check our implementation by comparing the results to the results from the CountVectorizer class in the sklearn.feature_extraction.text module. Note that the CountVectorizer will return a sparse matrix, so we need to convert it to a dense matrix using the toarray() method. Furthermore, the matrix is in the shape \(n_{docs}:= D \times n_{words}: T\), so we need to transpose it using the T attribute.

np.testing.assert_array_equal(
    X, count.toarray().T, err_msg="Our count vectorizer is wrong...?")

We have successfully implemented the count matrix. While the method is not very efficient at the moment of writing, it is what come to mind first. In reality, one could skip the dictionary portion and directly create a matrix of shape \(n_{words} \times n_{docs}\), and then fill it with the counts.

Getting the TF Matrix#

Create X_tf which holds the term frequency values for each word in each document.

\[ \operatorname{tf}_{t, d} = \log_{10}(\text{count}_{t, d} + 1) \]
- Create a new numpy array X_tf with the same shape and data type as X, but filled with uninitialized floats.
- Loop through each row and its index in X using the enumerate() function.
- Calculate the term frequency of each word in the row using the formula tf = log10(each_row + 1), where count is the frequency count of the word in the row. Now, each_row is just the frequency of word \(t\) in document \(d\). So we can directly apply np.log10().
- Assign the calculated term frequency values to the corresponding row in X_tf using row_index.

X_tf = np.empty_like(X, dtype=float)
for row_index, each_row in enumerate(X):
    tf = np.log10(each_row + 1)
    X_tf[row_index] = tf

print(f"Term frequency array shape: {X_tf.shape}")
print(X_tf)

Term frequency array shape: (30, 5)
[[0.         0.30103    0.         0.         0.        ]
 [0.         0.         0.         0.30103    0.        ]
 [0.30103    0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.30103   ]
 [0.30103    0.         0.         0.         0.        ]
 [0.         0.30103    0.         0.         0.        ]
 [0.         0.         0.         0.         0.30103   ]
 [0.         0.         0.30103    0.30103    0.        ]
 [0.         0.30103    0.         0.         0.        ]
 [0.         0.         0.         0.         0.30103   ]
 [0.         0.         0.30103    0.         0.        ]
 [0.30103    0.         0.         0.         0.        ]
 [0.30103    0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.30103   ]
 [0.30103    0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.30103   ]
 [0.         0.30103    0.         0.30103    0.        ]
 [0.         0.         0.         0.         0.30103   ]
 [0.         0.         0.         0.30103    0.        ]
 [0.         0.30103    0.         0.         0.        ]
 [0.         0.         0.         0.30103    0.        ]
 [0.         0.         0.         0.         0.30103   ]
 [0.         0.         0.30103    0.         0.        ]
 [0.         0.30103    0.         0.         0.        ]
 [0.30103    0.30103    0.         0.         0.        ]
 [0.30103    0.30103    0.30103    0.         0.        ]
 [0.30103    0.30103    0.         0.         0.        ]
 [0.60205999 0.47712125 0.30103    0.47712125 0.47712125]
 [0.         0.         0.30103    0.         0.        ]
 [0.         0.         0.         0.30103    0.        ]]

Creating the IDF vector#

Create X_idf which holds the inverse document frequency values for each word in the corpus.

\[ \text{idf}_{t} = \log_{10}\left(\frac{n_{docs}}{1 + \text{df}_{t}}\right) \]
- Create a new numpy array X_idf with the same shape and data type as X, but filled with uninitialized floats.
- Loop through each column and its index in X using the enumerate() function.
- Calculate the inverse document frequency of each word in the column using the formula idf = log10(n_docs / (1 + df)), where df is the document frequency of the word in the column.
- Assign the calculated inverse document frequency values to the corresponding column in X_idf using col_index.

X_idf = np.zeros(num_vocabs, dtype=float)

for row_index, each_row in enumerate(X):
    df = np.count_nonzero(each_row) # df is document frequency and it answers how many documents contain this word
    idf = np.log(num_documents / df) + 1 # for eg, if we have 4 documents and a word that appears in 2 documents, then idf = log10(4/2) = 0.301
    X_idf[row_index] = idf

print(f"IDF array shape: {X_idf.shape}")
print(X_idf)

IDF array shape: (30,)
[2.60943791 2.60943791 2.60943791 2.60943791 2.60943791 2.60943791
 2.60943791 1.91629073 2.60943791 2.60943791 2.60943791 2.60943791
 2.60943791 2.60943791 2.60943791 2.60943791 1.91629073 2.60943791
 2.60943791 2.60943791 2.60943791 2.60943791 2.60943791 2.60943791
 1.91629073 1.51082562 1.91629073 1.         2.60943791 2.60943791]

Sanity check!

np.testing.assert_allclose(
    X_idf, idf_vec, rtol=1e-3, err_msg="Our idf_vec is not correct...?"
)

TF-IDF Matrix#

We broadcast X_idf to the shape of X_tf and multiply the two matrices element-wise to get the TF-IDF matrix.

X_tfidf = X_tf * X_idf.reshape(-1, 1)
X_tfidf.shape, X_tfidf

((30, 5),
 array([[0.        , 0.78551908, 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.78551908, 0.        ],
        [0.78551908, 0.        , 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.78551908],
        [0.78551908, 0.        , 0.        , 0.        , 0.        ],
        [0.        , 0.78551908, 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.78551908],
        [0.        , 0.        , 0.57686099, 0.57686099, 0.        ],
        [0.        , 0.78551908, 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.78551908],
        [0.        , 0.        , 0.78551908, 0.        , 0.        ],
        [0.78551908, 0.        , 0.        , 0.        , 0.        ],
        [0.78551908, 0.        , 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.78551908],
        [0.78551908, 0.        , 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.78551908],
        [0.        , 0.57686099, 0.        , 0.57686099, 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.78551908],
        [0.        , 0.        , 0.        , 0.78551908, 0.        ],
        [0.        , 0.78551908, 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.78551908, 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.78551908],
        [0.        , 0.        , 0.78551908, 0.        , 0.        ],
        [0.        , 0.78551908, 0.        , 0.        , 0.        ],
        [0.57686099, 0.57686099, 0.        , 0.        , 0.        ],
        [0.45480383, 0.45480383, 0.45480383, 0.        , 0.        ],
        [0.57686099, 0.57686099, 0.        , 0.        , 0.        ],
        [0.60205999, 0.47712125, 0.30103   , 0.47712125, 0.47712125],
        [0.        , 0.        , 0.78551908, 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.78551908, 0.        ]]))

If you do a sanity check now, there will be error!

The reason why this outcome is different from scikit-learn is because scikit-learn uses a different formula for calculating the TF-IDF.

For their term frequency, they use the raw count matrix instead of compressing it to a log scale. Let’s replace X_tf with X and see.

X_tfidf = X * X_idf.reshape(-1, 1)
X_tfidf.shape, X_tfidf

((30, 5),
 array([[0.        , 2.60943791, 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 2.60943791, 0.        ],
        [2.60943791, 0.        , 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 2.60943791],
        [2.60943791, 0.        , 0.        , 0.        , 0.        ],
        [0.        , 2.60943791, 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 2.60943791],
        [0.        , 0.        , 1.91629073, 1.91629073, 0.        ],
        [0.        , 2.60943791, 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 2.60943791],
        [0.        , 0.        , 2.60943791, 0.        , 0.        ],
        [2.60943791, 0.        , 0.        , 0.        , 0.        ],
        [2.60943791, 0.        , 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 2.60943791],
        [2.60943791, 0.        , 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 2.60943791],
        [0.        , 1.91629073, 0.        , 1.91629073, 0.        ],
        [0.        , 0.        , 0.        , 0.        , 2.60943791],
        [0.        , 0.        , 0.        , 2.60943791, 0.        ],
        [0.        , 2.60943791, 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 2.60943791, 0.        ],
        [0.        , 0.        , 0.        , 0.        , 2.60943791],
        [0.        , 0.        , 2.60943791, 0.        , 0.        ],
        [0.        , 2.60943791, 0.        , 0.        , 0.        ],
        [1.91629073, 1.91629073, 0.        , 0.        , 0.        ],
        [1.51082562, 1.51082562, 1.51082562, 0.        , 0.        ],
        [1.91629073, 1.91629073, 0.        , 0.        , 0.        ],
        [3.        , 2.        , 1.        , 2.        , 2.        ],
        [0.        , 0.        , 2.60943791, 0.        , 0.        ],
        [0.        , 0.        , 0.        , 2.60943791, 0.        ]]))

np.testing.assert_allclose(
    X_tfidf,
    tfidf_matrix.toarray().T,
    rtol=1e-3,
    err_msg="Our implementation of tfidf is not correct...?"
)

The results now match.

References and Further Readings#

How sklearn’s Tfidfvectorizer Calculates tf-idf Values.