Concept

Concept#

Definition#

Definition 13 (Cosine Similarity)

Given two vectors \(\mathbf{u}\) and \(\mathbf{v}\) defined as:

\[\begin{split} \mathbf{u}=\begin{bmatrix}u_{1} \\ u_{2} \\ \vdots \\ u_{n}\end{bmatrix}, \quad \mathbf{v}=\begin{bmatrix}v_{1} \\ v_{2} \\ \vdots \\ v_{n}\end{bmatrix} \end{split}\]

The cosine of two non-zero vectors can be derived by using the Euclidean dot product formula:

\[ \mathbf{u} \cdot \mathbf{v}=\|\mathbf{u}\|\|\mathbf{v}\| \cos \theta \]

where \(\theta\) is the angle between the two vectors.

Given two \(n\)-dimensional vectors of attributes, \(\mathbf{u}\) and \(\mathbf{v}\), the \(\operatorname{cosine}\) similarity, \(\cos (\theta)\), is represented using a dot product and magnitude as

\[ \text { cosine similarity }=S_C(\mathbf{u} , \mathbf{v}):=\cos (\theta)=\frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\|\|\mathbf{v}\|}=\frac{\sum_{i=1}^n u_i v_i}{\sqrt{\sum_{i=1}^n u_i^2} \sqrt{\sum_{i=1}^n v_i^2}}, \]

where \(u_i\) and \(v_i\) are the \(i\) th components of vectors \(\mathbf{u}\) and \(\mathbf{v}\), respectively.

Cosine Distance#

The term cosine distance is commonly used for the complement of cosine similarity in positive space, that is

\[ \text { cosine distance }=D_C(u, v):=1-S_C(u, v) . \]

It is important to note that the cosine distance is not a true distance metric as it does not exhibit the triangle inequality property-or, more formally, the Schwarz inequality-and it violates the coincidence axiom. One way to see this is to note that the cosine distance is half of the squared Euclidean distance of the \(L_2\) normalization of the vectors, and squared Euclidean distance does not satisfy the triangle inequality either. To repair the triangle inequality property while maintaining the same ordering, it is necessary to convert to angular distance or Euclidean distance. Alternatively, the triangular inequality that does work for angular distances can be expressed directly in terms of the cosines; see below.

Properties#

Interpretation of Angles#

The following properties are adapted from Wikipedia: Cosine Similarity.

The resulting similarity ranges from -1 meaning exactly opposite, to 1 meaning exactly the same, with 0 indicating orthogonality or decorrelation, while inbetween values indicate intermediate similarity or dissimilarity.

To interpet the cosine similarity, we have the follows:

If the angle between the two vectors is \(0^{\circ}\), then the cosine similarity is \(1\). This means that the two vectors are identical.
If the angle between the two vectors is \(90^{\circ}\), then the cosine similarity is \(0\). This means that the two vectors are orthogonal.
If the angle between the two vectors is \(180^{\circ}\), then the cosine similarity is \(-1\). This means that the two vectors are exactly opposite.
If the angle between the two vectors is anywhere between \(0^{\circ}\) and \(90^{\circ}\), then the cosine similarity is positive. This means that the two vectors are similar to the extent of the angle between them.

For text matching, the attribute vectors \(u\) and \(v\) are usually the term frequency vectors of the documents. Cosine similarity can be seen as a method of normalizing document length during comparison. In the case of information retrieval, the cosine similarity of two documents will range from \(0 \rightarrow 1\), since the term frequencies cannot be negative. This remains true when using TF-IDF weights. The angle between two term frequency vectors cannot be greater than \(90^{\circ}\).

If the attribute vectors are normalized by subtracting the vector means (e.g., \(u-\bar{u}\) ), the measure is called the centered cosine similarity and is equivalent to the Pearson correlation coefficient. For an example of centering, if \(u=\left[u_1, u_2\right]^T\), then \(\bar{u}=\left[\frac{\left(u_1+u_2\right)}{2}, \frac{\left(u_1+u_2\right)}{2}\right]^T\), so \(u-\bar{u}=\left[\frac{\left(u_1-u_2\right)}{2}, \frac{\left(-u_1+u_2\right)}{2}\right]^T\).

Relationship to Euclidean Distance#

The most noteworthy property of cosine similarity is that it reflects a relative, rather than absolute, comparison of the individual vector dimensions. For any constant \(a\) and vector \(V\), the vectors \(V\) and \(a V\) are maximally similar. The measure is thus most appropriate for data where frequency is more important than absolute values; notably, term frequency in documents. However more recent metrics with a grounding in information theory, such as JensenShannon, SED, and triangular divergence have been shown to have improved semantics in at least some contexts.

Cosine similarity is related to Euclidean distance as follows. Denote Euclidean distance by the usual \(\|A-B\|\), and observe that

\[ \|A-B\|^2=(A-B) \cdot(A-B)=\|A\|^2+\|B\|^2-2(A \cdot B) \text { (polarization identity) } \]

by expansion. When \(A\) and \(B\) are normalized to unit length, \(\|A\|^2=\|B\|^2=1\) so this expression is equal to

\[ 2(1-\cos (A, B)) \text {. } \]

In short, the cosine distance can be expressed in terms of Euclidean distance as

\[ D_C(A, B)=\frac{\|A-B\|^2}{2} \quad \text { when } \quad\|A\|^2=\|B\|^2=1 . \]

The Euclidean distance is called the chord distance (because it is the length of the chord on the unit circle) and it is the Euclidean distance between the vectors which were normalized to unit sum of squared values within them.

Visualizations#

Consider plotting out below:

import numpy as np
from scipy.spatial.distance import cosine
from rich import print

In this case, the two vectors a and b are parallel and pointing in the same direction. The cosine of the angle between them is 1, which means the cosine similarity is also 1.

a = np.array([1, 2, 3])
b = np.array([2, 4, 6])
cos_sim = 1 - cosine(a, b)
print(f"The two vectors are parallel: {cos_sim:.2f}.")

The two vectors are parallel: 1.00.

In this case, the two vectors a and b are parallel but pointing in opposite directions. The cosine of the angle between them is -1, which means the cosine similarity is also -1.

a = np.array([1, 2, 3])
b = np.array([-1, -2, -3])
cos_sim = 1 - cosine(a, b)
print(f"The two vectors are parallel but point in opposite directions: {cos_sim:.2f}.")

The two vectors are parallel but point in opposite directions: -1.00.

When 2 vectors are similar, then we see that the cosine similarity is high.

a = np.array([1, 2, 3])
b = np.array([3,4,5])
cos_sim = 1 - cosine(a, b)
print(f"The two vectors are not parallel but similar: {cos_sim:.2f}.")

The two vectors are not parallel but similar: 0.98.

When two vectors are orthogonal, the cosine similarity is 0.

a = np.array([1, 0, 0])
b = np.array([0, 1, 0])
cos_sim = 1 - cosine(a, b)
print(f"The two vectors are orthogonal: {cos_sim:.2f}.")

The two vectors are orthogonal: 0.00.

Conceptual Questions#

Why do you divide by the magnitude of the vectors in the cosine similarity formula?#

PLOT OUT

Recall that the dot product of two vectors \(\mathbf{u}\) and \(\mathbf{v}\) is defined as:

\[\begin{split} \begin{aligned} \mathbf{u} \cdot \mathbf{v}&=\sum_{i=1}^n u_i v_i \\ &= u_1 v_1 + u_2 v_2 + \cdots + u_n v_n \end{aligned} \end{split}\]

The dot product in itself roughly tells you how two vector relate to each other.

Consider the examples earlier:

a = np.array([1, 2, 3])
b = np.array([-1, -2, -3])

These two vectors are pointing in opposite directions, and therefore the dot product is negative, indicating dissimilarity.

If they were pointing in the same direction, the dot product would be positive, indicating similarity. In general, the dot product is a measure of similarity.

However, this is a problem as a long vector \(\mathbf{u}\) will naturally have higher dot product, especially if there exists large values in the vector.

The description “long” can be made more precise by defining the length of a vector \(\mathbf{u}\) as:

\[\begin{split} \begin{aligned} \|\mathbf{u}\|&=\sqrt{\mathbf{u} \cdot \mathbf{u}} \\ &=\sqrt{\sum_{i=1}^n u_i^2} \end{aligned} \end{split}\]

So, long vector means a vector with large magnitude.

Consider the word that is a frequent word, such as the, this word co-occurs with other words often, and also appear in many documents, and hence it has many large values in the vector. Therefore, the word vector for the will be dense, and have large values (large frequency).

We do not want this property since we want to know how similar two words (vectors) are, regardless of the frequency of the words.

Thus, if we normalize the dot product by the magnitude of the vectors, we can get a measure of similarity that is independent of the magnitude of the vectors.

Let’s define 3 words:

a: represents the word “cat”;
b: represents the word “the”;
c: represents the word “he”.

a = np.array([100, 200, 300]) # word = cat
b = np.array([3000,4000,5000]) # word = the
c = np.array([2500, 4000, 6000]) # word = he

Each vector represents a 3-dimensional vector where each dimensional corresponds to the frequency of the word across \(3\) documents, doc1, doc2, and doc3.

We see that the word the and he is long because they have large magnitude, and the word cat is short because it has small magnitude.

Let’s compute the dot product between each pair.

dot_ab = np.dot(a, b)
dot_ac = np.dot(a, c)
dot_bc = np.dot(b, c)
print(dot_ab)  # Output: 26000000
print(dot_ac)  # Output: 28000000
print(dot_bc)  # Output: 52000000

53500000

So the word cat and the are similar, but the and he are even more similar, by a much larger scale as can be seen. This is purely because the is a very frequent word, and hence it has a very large magnitude (long) and so is he, but this does not diminish the importance of the similarity between the and cat for example. So dot product does not take this into account.

Let’s use the cosine similarity formula to measure the similarity between the words, and now the scale difference is not as large since we are looking at the angles between two vectors, regarless of the magnitude of the vectors.

cosine_ab = 1 - cosine(a, b)
cosine_ac = 1 - cosine(a, c)
cosine_bc = 1 - cosine(b, c)
print(cosine_ab)  # Output: 0.9746318461970762
print(cosine_ac)  # Output: 0.9746318461970762
print(cosine_bc)  # Output: 0.9746318461970762

0.9827076298239907

0.9980053681713515

0.99133585686985

If summarize in one sentence, cosine similarity takes relative angles between two vectors into account, regardless of the magnitude of the vectors.

Why can’t we use the Euclidean distance as a similarity measure?#

First the Euclidean distance between two vectors \(\mathbf{u}\) and \(\mathbf{v}\) is defined as:

\[\begin{split} \begin{aligned} \|\mathbf{u}-\mathbf{v}\|&=\sqrt{\left(\mathbf{u}-\mathbf{v}\right) \cdot \left(\mathbf{u}-\mathbf{v}\right)} \\ &=\sqrt{\sum_{i=1}^n \left(u_i-v_i\right)^2} \end{aligned} \end{split}\]

Let’s use the same example.

a = np.array([100, 200, 300]) # word = cat
b = np.array([3000,4000,5000]) # word = the
c = np.array([2500, 4000, 6000]) # word = he

euclidean_ab = np.linalg.norm(a - b)
euclidean_ac = np.linalg.norm(a - c)
euclidean_bc = np.linalg.norm(b - c)

print(euclidean_ab)  
print(euclidean_ac)  
print(euclidean_bc) 

6703.730304837748

7258.787777583802

1118.033988749895

It shows that the word cat is very dissimilar to the and he when compared to the word the and he, this is because the Euclidean distance is large between cat and the and he, and small between the and he.

Once again, the magnitude of the vectors is a problem, as when a vector is large in magnitude, the Euclidean distance is naturally dominated by the magnitude of the vector.

References and Further Readings#

Wikipedia Contributors. “Cosine Similarity.” Wikipedia. Wikimedia Foundation, February 26, 2023. https://en.wikipedia.org/wiki/Cosine_similarity.
Jurafsky, Dan, and James H. Martin. “Chapter 6.4. Cosine for Measuring Similarity.” In Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Pearson, 2022.