Concept#

Definition#

Definition 147 (Cosine Similarity)

The cosine of two non-zero vectors can be derived by using the Euclidean dot product formula:

\[ \mathbf{u} \cdot \mathbf{v}=\|\mathbf{u}\|\|\mathbf{v}\| \cos \theta \]

Given two \(n\)-dimensional vectors of attributes, \(\mathbf{u}\) and \(\mathbf{v}\), the \(\operatorname{cosine}\) similarity, \(\cos (\theta)\), is represented using a dot product and magnitude as

\[ \text { cosine similarity }=S_C(\mathbf{u} , \mathbf{v}):=\cos (\theta)=\frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\|\|\mathbf{v}\|}=\frac{\sum_{i=1}^n u_i v_i}{\sqrt{\sum_{i=1}^n u_i^2} \sqrt{\sum_{i=1}^n v_i^2}}, \]

where \(u_i\) and \(v_i\) are the \(i\) th components of vectors \(\mathbf{u}\) and \(\mathbf{v}\), respectively.

Properties#

Following extracted from Wikipedia:

The resulting similarity ranges from -1 meaning exactly opposite, to 1 meaning exactly the same, with 0 indicating orthogonality or decorrelation, while inbetween values indicate intermediate similarity or dissimilarity.

For text matching, the attribute vectors \(u\) and \(v\) are usually the term frequency vectors of the documents. Cosine similarity can be seen as a method of normalizing document length during comparison. In the case of information retrieval, the cosine similarity of two documents will range from \(0 \rightarrow 1\), since the term frequencies cannot be negative. This remains true when using TF-IDF weights. The angle between two term frequency vectors cannot be greater than \(90^{\circ}\).

If the attribute vectors are normalized by subtracting the vector means (e.g., \(u-\bar{u}\) ), the measure is called the centered cosine similarity and is equivalent to the Pearson correlation coefficient. For an example of centering, if \(u=\left[u_1, u_2\right]^T\), then \(\bar{u}=\left[\frac{\left(u_1+u_2\right)}{2}, \frac{\left(u_1+u_2\right)}{2}\right]^T\), so \(u-\bar{u}=\left[\frac{\left(u_1-u_2\right)}{2}, \frac{\left(-u_1+u_2\right)}{2}\right]^T\).

The term cosine distance is commonly used for the complement of cosine similarity in positive space, that is

\[ \text { cosine distance }=D_C(u, v):=1-S_C(u, v) . \]

It is important to note that the cosine distance is not a true distance metric as it does not exhibit the triangle inequality property-or, more formally, the Schwarz inequality-and it violates the coincidence axiom. One way to see this is to note that the cosine distance is half of the squared Euclidean distance of the \(L_2\) normalization of the vectors, and squared Euclidean distance does not satisfy the triangle inequality either. To repair the triangle inequality property while maintaining the same ordering, it is necessary to convert to angular distance or Euclidean distance. Alternatively, the triangular inequality that does work for angular distances can be expressed directly in terms of the cosines; see below.

The most noteworthy property of cosine similarity is that it reflects a relative, rather than absolute, comparison of the individual vector dimensions. For any constant \(a\) and vector \(V\), the vectors \(V\) and \(a V\) are maximally similar. The measure is thus most appropriate for data where frequency is more important than absolute values; notably, term frequency in documents. However more recent metrics with a grounding in information theory, such as JensenShannon, SED, and triangular divergence have been shown to have improved semantics in at least some contexts.

Cosine similarity is related to Euclidean distance as follows. Denote Euclidean distance by the usual \(\|A-B\|\), and observe that

\[ \|A-B\|^2=(A-B) \cdot(A-B)=\|A\|^2+\|B\|^2-2(A \cdot B) \text { (polarization identity) } \]

by expansion. When \(A\) and \(B\) are normalized to unit length, \(\|A\|^2=\|B\|^2=1\) so this expression is equal to

\[ 2(1-\cos (A, B)) \text {. } \]

In short, the cosine distance can be expressed in terms of Euclidean distance as

\[ D_C(A, B)=\frac{\|A-B\|^2}{2} \quad \text { when } \quad\|A\|^2=\|B\|^2=1 . \]

The Euclidean distance is called the chord distance (because it is the length of the chord on the unit circle) and it is the Euclidean distance between the vectors which were normalized to unit sum of squared values within them.

Null distribution: For data which can be negative as well as positive, the null distribution for cosine similarity is the distribution of the dot product of two independent random unit vectors. This distribution has a mean of zero and a variance of \(1 / n\) (where \(n\) is the number of dimensions), and although the distribution is bounded between \(-1\) and \(+1\), as \(n\) grows large the distribution is increasingly well-approximated by the normal distribution. Other types of data such as bitstreams, which only take the values 0 or 1 , the null distribution takes a different form and may have a nonzero mean.

Consider plotting out below:

import numpy as np
from scipy.spatial.distance import cosine
from rich import print

In this case, the two vectors a and b are parallel and pointing in the same direction. The cosine of the angle between them is 1, which means the cosine similarity is also 1.

a = np.array([1, 2, 3])
b = np.array([2, 4, 6])
cos_sim = 1 - cosine(a, b)
print(cos_sim)  # Output: 1.0
1

In this case, the two vectors a and b are parallel but pointing in opposite directions. The cosine of the angle between them is -1, which means the cosine similarity is also -1.

a = np.array([1, 2, 3])
b = np.array([-1, -2, -3])
cos_sim =  1 - cosine(a, b)
print(cos_sim)  # Output: -1.0
-1.0

When 2 vectors are similar, then we see that the cosine similarity is high.

a = np.array([1, 2, 3])
b = np.array([3,4,5])
cos_sim = 1 - cosine(a, b)
print(cos_sim)  # Output: 0.9746318461970762
0.9827076298239907

When two vectors are orthogonal, the cosine similarity is 0.

a = np.array([1, 0, 0])
b = np.array([0, 1, 0])
cos_sim = 1 - cosine(a, b)
print(cos_sim)  # Output: 0.0
0.0

Conceptual Questions#

Why do you divide by the magnitude of the vectors in the cosine similarity formula?#

PLOT OUT

Recall that the dot product of two vectors \(\mathbf{u}\) and \(\mathbf{v}\) is defined as:

\[\begin{split} \begin{aligned} \mathbf{u} \cdot \mathbf{v}&=\sum_{i=1}^n u_i v_i \\ &= u_1 v_1 + u_2 v_2 + \cdots + u_n v_n \end{aligned} \end{split}\]

The dot product in itself roughly tells you how two vector relate to each other.

Consider the examples earlier:

a = np.array([1, 2, 3])
b = np.array([-1, -2, -3])

These two vectors are pointing in opposite directions, and therefore the dot product is negative, indicating dissimilarity.

If they were pointing in the same direction, the dot product would be positive, indicating similarity. In general, the dot product is a measure of similarity.

However, this is a problem as a long vector \(\mathbf{u}\) will naturally have higher dot product, especially if there exists large values in the vector.

The description “long” can be made more precise by defining the length of a vector \(\mathbf{u}\) as:

\[\begin{split} \begin{aligned} \|\mathbf{u}\|&=\sqrt{\mathbf{u} \cdot \mathbf{u}} \\ &=\sqrt{\sum_{i=1}^n u_i^2} \end{aligned} \end{split}\]

So, long vector means a vector with large magnitude.

Consider the word that is a frequent word, such as good, this word co-occurs with other words often, and also appear in many documents, and hence it has many large values in the vector. Therefore, the word vector for good will be dense, and have large values (large frequency).

We do not want this property since we want to know how similar two words (vectors) are, regardless of the frequency of the words.

Thus, if we normalize the dot product by the magnitude of the vectors, we can get a measure of similarity that is independent of the magnitude of the vectors.

a = np.array([100, 200, 300]) # word = cat
b = np.array([3000,4000,5000]) # word = the
c = np.array([2500, 4000, 6000]) # word = he
dot_ab = np.dot(a, b)
dot_ac = np.dot(a, c)
dot_bc = np.dot(b, c)
print(dot_ab)  # Output: 26000000
print(dot_ac)  # Output: 28000000
print(dot_bc)  # Output: 52000000
2600000
2850000
53500000

So the word cat and the are similar, but the and he are even more similar, by a much larger scale as can be seen. This is purely because the is a very frequent word, and hence it has a very large magnitude (long) and so is he, but this does not diminish the importance of the similarity between the and cat for example. So dot product does not take this into account.

Let’s use the cosine similarity formula to measure the similarity between the words, and now the scale difference is not as large since we are looking at the angles between two vectors, regarless of the magnitude of the vectors.

cosine_ab = 1 - cosine(a, b)
cosine_ac = 1 - cosine(a, c)
cosine_bc = 1 - cosine(b, c)
print(cosine_ab)  # Output: 0.9746318461970762
print(cosine_ac)  # Output: 0.9746318461970762
print(cosine_bc)  # Output: 0.9746318461970762
0.9827076298239907
0.9980053681713515
0.99133585686985

If summarize in one sentence, cosine similarity takes relative angles between two vectors into account, regardless of the magnitude of the vectors.

Why can’t we use the Euclidean distance as a similarity measure?#

SEE my handwritten notes.

Further Readings#