Concept
Contents
Concept#
Problem Statement#
(Remark)
Although K-Means does not explicitly model the underlying distribution \(\mathcal{D}\), we can still apply the learning theory framework to K-Means.
Given a set of \(N\) data points (samples)
where the \(n\)-th sample is a vector with \(D\) number of features.
We can write \(\mathcal{S}\) as a disjoint union 1 of \(K\) clusters,
where
In other words, we can decompose the \(N\) data points into \(K\) clusters, where \(K\) is a priori, a pre-defined number.
\(y^{(n)} \in \{1, 2, \dots, K\}\) in the above equation (122) refers to the cluster (ground truth) label of data point \(\mathbf{x}^{(n)}\).
Note that in an unsupervised problem, this \(y^{(n)}\) is generally not known to us, but we can have a mental model that for each data point, there is an underlying cluster label \(y^{(n)}\) that it should belong to.
We further define \(C\) as the collection of clusters2,
To this end, the goal of such an unsupervised problem is to find the clusters
where
\(\hat{C}\) is the predicted clusters that best approximate the ground truth clusters \(C\).
In other words, we want to find the clusters \(\hat{C}\) that are the closest to the ground truth clusters \(C\), we will make precise the notion of “close” later.
Note that \(\mathcal{A}(\cdot)\) is the assignment map that “predicts” and “classify” each data point into their respective clusters.
Sometimes for ease of notation, we also say that \(\hat{C}_k\) is
K-Means clustering’s goal is to find the clusters \(\hat{C}\) that are the closest to the ground truth clusters \(C\) (hard clustering). In other words, we aim to partition \(N\) data points \(\mathcal{S}\) into \(K\) clusters \(\hat{C}\). The problem in itself seems manageable, since we can simply partition the data points into \(K\) clusters and minimize the intra-cluster distance (variances). However, it is computationally challenging to solve the problem (NP-hard).
Consequently, there are many heuristics that are used to solve the problem. We will talk about one of the most popular heuristics, the Lloyd’s algorithm in this section.
In this algorithm, each cluster \(C_k\) has centroids (centers)
where \(\boldsymbol{v}_k\) is the centroid of cluster \(C_k\). Each centroid \(\boldsymbol{v}_k\) is a vector that has the same dimension as the data points \(\mathbf{x}^{(n)}\) and is the representative vector of the cluster \(C_k\). By representative vector, we mean that \(\boldsymbol{v}_k\) is a vector that can “describe” the cluster \(C_k\). By construction, the centroids can be defined as any vector \(\boldsymbol{v}_k\) that has the same dimension as the data points \(\mathbf{x}^{(n)}\). However, an intuitive choice is to use the mean of the data points \(\boldsymbol{\mu}_k\) in the cluster \(\hat{C}_k\) as the representative vector \(\boldsymbol{v}_k\).
Furthermore, given the representative vectors \(\left\{\boldsymbol{v}_k\right\}_{k=1}^K\), we need an assignment function \(\mathcal{A}(n) = \hat{y}^{(n)}\) that assigns each data point \(\mathbf{x}^{(n)}\) to the cluster \(\hat{C}_k\). An intuitive choice is to compare “closeness” of each \(\mathbf{x}^{(n)}\) to the representative vectors \(\left\{\boldsymbol{v}_k\right\}_{k=1}^K\) and assign it to the cluster \(\hat{C}_k\) that is closest to the representative vector \(\boldsymbol{v}_k\).
We will make these intuition more precise later by proving it.
To this end, we can informally define an empirical cost function that measures the quality of the requirements listed earlier.
Note that the clustering error \(\widehat{\mathcal{J}}\) depends on both, the cluster assignments \(\hat{y}^{(n)}\), which define the clusters \(\hat{C}_k\), and the cluster representatives \(\boldsymbol{v}_k\), for \(k=1, \ldots, K\). As mentioned earlier, finding the optimal cluster means \(\left\{\boldsymbol{v}_k\right\}_{k=1}^K\) and cluster assignments \(\left\{\hat{y}^{(n)}\right\}_{n=1}^N\) that minimize the clustering error \(\widehat{\mathcal{J}}\) is a NP-hard problem. The difficulty stems from the fact that the clustering error \(\mathcal{J}\) is a non-convex function of the cluster means and assignments. In other words, there are many local minima of the clustering error \(\mathcal{J}\), and finding the global minimum is hard.
While jointly optimizing the cluster means and assignments is hard3, separately optimizing either the cluster means for given assignments or vice-versa is easy. In what follows, we present simple closed-form solutions for these sub-problems. The \(k\)-means method simply combines these solutions in an alternating fashion [Jung, 2023].
More concretely, we want to show:
For fixed cluster assignments \(\mathcal{A}(n) = \hat{y}^{(n)}\), the clustering error \(\widehat{\mathcal{J}}\) is minimized by setting the cluster representatives \(\boldsymbol{v}_k\) equal to the cluster means, this means the mean vector is the optimal choice for the cluster center.
\[ \boldsymbol{\mu}_1, \boldsymbol{\mu}_2, \dots, \boldsymbol{\mu}_K \in \mathbb{R}^{D} \]where each \(\boldsymbol{\mu}_k\) is the mean vector of the data points in cluster \(C_k\).
Furthermore, now when we obtain the cluster means \(\boldsymbol{\mu}_k\) (now we fix \(\boldsymbol{\mu}_k\), we can assign data points \(\mathbf{x}^{(n)}\) to the cluster \(\hat{C}_k\) that is closest to the cluster mean \(\boldsymbol{\mu}_k\). This assignment action is called the assignment function \(\mathcal{A}\), a function that does the assignment of data points to clusters. We will show later that the clustering error \(\widehat{\mathcal{J}}\) is minimized when the assignment function is the nearest neighbor assignment function \(\mathcal{A}^{*}\),
\[ \mathcal{A}^{*}(n) = \underset{k}{\operatorname{argmin}} \left\|\mathbf{x}^{(n)} - \boldsymbol{\mu}_k \right\|^2 \]where it assigns data points \(\mathbf{x}^{(n)}\) to the cluster \(k\) whose center \(\boldsymbol{\mu}_k\) is closest.
We see that instead of jointly optimizing the cluster means and assignments in one step, we alternate between the two steps. We first fix the cluster assignments and optimize the cluster means, and then we fix the cluster means and optimize the cluster assignments.
Now we are left with phrasing the problem as an optimization problem. The goal is to find the optimal cluster centers and cluster assignments that minimize the clustering error.
Intuition#
Some intuition on choosing the cost function \(\widehat{\mathcal{J}}\).
In supervised learning, we have our typical loss functions such as cross-entropy loss (classification), and in regression, we have mean squared error. We also have metrics like accuracy, precision, recall, etc to measure the performance of the model.
This means, given a hypothesis \(\hat{y}:=h(\mathbf{x})\), how close is it to the true label \(y\)? In unsupervised, we do not have such ground truth label \(y\) to compare with, but the notion of closeness is still there.
Example#
Let’s first look at an example by randomly generating data points4 that can be partitioned into 3 distinct clusters.
Visually, we can literally just circle out the 3 clusters. The luxury of such simplicity is because we are working with 2 features, i.e. \(\mathbf{x} \in \mathbb{R}^{2}\). In real world, we are working with \(D\) features in \(\mathbb{R}^{D}\), where \(D\) can be very large. Furthermore, even with such a simple dataset, how do we tell the machine to find the 3 clusters?
The Notion of Similarity and Closeness#
To define such a metric for unsupervised learning, we can fall back on our intuition. The purpose of clustering is to group similar data points together. So we seek to find a metric that measures the similarity between data points in a dataset.
A very simple idea is to use intra-cluster variance. For example, within a cluster, the data points are close to each other if the variance is small.
Consequently, to make our intuition precise, we need to define a metric rule and an assignment \(\mathcal{A}\) to assign data points to clusters. We also need to define the notion of closeness and similarity between data points.
Lastly, such algorithms require an initial guess of the cluster centers, so that eventually the algorithm can converge to the optimal cluster centers, since we have no way of knowing the optimal cluster centers beforehand, especially in high dimensional space.
More formally, the optimization problem requires us to minimize the sum of squared distances between each data point and its cluster center. This is equivalent to minimizing the variance within each cluster.
Let’s look at some definitions first that will gradually lead us to the objective function.
Partition and Voronoi Regions#
We sometimes define \(C_k \in C\) as the partition of the data set \(\mathcal{S}\), which is a a subset of \(\mathcal{S}\), where each subset is a cluster. We say that \(C_k\) is a representative of the cluster \(k\) and induces a Voronoi partition of \(\mathbb{R}^D\). More formally, we define the Voronoi partition as follows:
(K-Means Voronoi Partition)
Let \(C = \{C_1, C_2, \ldots, C_K\}\) be a partition of \(\mathcal{S}\), where \(C_k \in C\) is a subset of \(\mathcal{S}\). Then \(C\) induces a Voronoi partition (Definition 117) of \(\mathbb{R}^D\), which decomposes \(\mathbb{R}^D\) into \(K\) convex cells, each corresponding to some \(C_k \in C\) and containing the region of space whose nearest representative is \(C_k\).
More concretely, the Voronoi region \(C_k\), contains all points \(\mathbf{x} \in \mathbb{R}^D\) such that
which means that the distance between \(\mathbf{x}\) and \(\boldsymbol{v}_k\) is less than or equal to the distance between \(\mathbf{x}\) and any other cluster center \(\boldsymbol{v}_j\).
Also,
Note there is an abuse of notation here, since \(C\) could be made more precise by writing it as \(\hat{C}\).
Assignment#
(Assignment)
An assignment is a surjective map,
In this case, \(\mathcal{A}(n) = k\) means that data point \(\mathbf{x}^{(n)}\) is assigned to cluster \(k\).
One should see that the assignment function \(\mathcal{A}\) gives rise to the prediction \(\hat{y}^{(n)}\) for each data point \(\mathbf{x}^{(n)}\).
(Assignment)
For example, if we have 4 data points \(\mathbf{x}^{(1)}\), \(\mathbf{x}^{(2)}\), \(\mathbf{x}^{(3)}\), and \(\mathbf{x}^{(4)}\), and we want to partition them into 3 clusters \(\hat{C}_1\), \(\hat{C}_2\), and \(\hat{C}_3\), we can define an assignment as follows:
Assign \(\mathbf{x}^{(1)}\) to \(\hat{C}_1\), \(\hat{C}_1 = \{\mathbf{x}^{(1)}\}\).
Assign \(\mathbf{x}^{(3)}\) to \(\hat{C}_2\), \(\hat{C}_2 = \{\mathbf{x}^{(3)}\}\).
Assign \(\mathbf{x}^{(2)}\) and \(\mathbf{x}^{(4)}\) to \(\hat{C}_3\), \(\hat{C}_3 = \{\mathbf{x}^{(2)}, \mathbf{x}^{(4)}\}\).
We can make this more precise by defining an assignment function \(\mathcal{A}\) as follows:
where
\(\mathcal{A}(1) = 1\),
\(\mathcal{A}(2) = 3\),
\(\mathcal{A}(3) = 2\), and
\(\mathcal{A}(4) = 3\).
Note in this case we did not explicitly define the assignment function \(\mathcal{A}\) as we will derive it later.
Centroids (Representatives)#
(Centroids)
The centroids of a partition \(\hat{C}\) are the representatives of each cluster \(C_k \in \hat{C}\). In other words, for each cluster \(C_k \in \hat{C}\), there exists a centroid \(\boldsymbol{v}_k \in C_k\).
Cost Function#
We make precise the notion of closeness and similarity between data points by defining a cost function utilizing the Euclidean distance. In practice, we can use other distance metrics such as Manhattan distance that suits one’s needs.
(K-Means Cost Function)
For any assignment \(\mathcal{A}\) that maps the set \(\{1, 2, \ldots, N\}\) to \(\{1, 2, \ldots, K\}\) and any centroids \(\boldsymbol{v}_1, \boldsymbol{v}_2, \dots, \boldsymbol{v}_K \in \mathbb{R}^{D}\), we construct the cost function as follows:
where
\(\mathcal{A}(n) = k\) means that data point \(\mathbf{x}^{(n)}\) is assigned to cluster \(k\).
\(r^{(n)}_k\) is an indicator function that is equal to 1 if \(\mathcal{A}(n) = k\) and 0 otherwise.
\[\begin{split} \begin{aligned} r^{(n)}_k &= \begin{cases} 1 & \text{if } \mathcal{A}(n) = k \\ 0 & \text{otherwise} \end{cases} \end{aligned} \end{split}\]\(\hat{C}_k\) is the set of data points that are assigned to cluster \(k\).
\(\left\|\cdot\right\|\) is the Euclidean norm.
\[\begin{split} \begin{aligned} \left\|\mathbf{x} - \boldsymbol{v}\right\|^2 &= \left(\mathbf{x} - \boldsymbol{v}\right)^{\top} \left(\mathbf{x} - \boldsymbol{v}\right) \\ \end{aligned} \end{split}\]All 5 forms are equivalent.
It is worth noting that we have not formally defined what the assignment \(\mathcal{A}\) is, as well as the vectors \(\boldsymbol{v}_1, \boldsymbol{v}_2, \dots, \boldsymbol{v}_K\). We will show later that \(\boldsymbol{v}_k\) is the mean of the data points in cluster \(k\) and that \(\mathcal{A}(n)=\underset{k}{\operatorname{argmin}} \left\|\mathbf{x}^{(n)} - \boldsymbol{\mu}_k \right\|^2\) is the assignment that minimizes the cost function \(\hat{\mathcal{J}}_{\mathcal{S}}\).
(Cost Function is a Function of Assignment and Centroids)
The cost function is a function of the assignment \(\mathcal{A}\) and the cluster centers \(\boldsymbol{v}_1, \boldsymbol{v}_2, \dots, \boldsymbol{v}_K\), which adds up the squared Euclidean distance between each data point and its assigned cluster center. The total cost is what we are minimizing. Note that the problem is equivalent to minimizing each cluster’s cost individually.
We also call the loss SSE, which is just the intra-cluster variance, a measure of how spread out the data points are within a cluster.
Objective Function#
Finally, we define the objective function as the cost function \(\hat{\mathcal{J}}_{\mathcal{S}}\) that we are minimizing.
(K-Means Objective Function)
The objective function is to minimize the above expression in (124):
This just means, for all possible assignments \(\mathcal{A}\) and cluster centers \(\boldsymbol{v}_1, \boldsymbol{v}_2, \ldots, \boldsymbol{v}_K\), we want to find the assignment \(\mathcal{A}\) and cluster centers \(\boldsymbol{v}_1, \boldsymbol{v}_2, \ldots, \boldsymbol{v}_K\) that minimize the cost function \(\hat{\mathcal{J}}_{\mathcal{S}}\).
In other words, of all possible sets \(\hat{C} = \left\{\hat{C}_1, \hat{C}_2, \ldots, \hat{C}_K\right\}\), we want to find the set \(\hat{C}\) that minimizes the cost function \(\hat{\mathcal{J}}_{\mathcal{S}}\).
(Minimizing Individual Cluster’s Cost is Equivalent to Minimizing the Objective Function)
The objective function is equivalent to minimizing each cluster’s cost individually.
Recall we mentioned that optimizing the objective function \(\hat{\mathcal{J}}_{\mathcal{S}}\) means we are finding the optimal assignment \(\mathcal{A}^*(\cdot)\) and the optimal cluster centers \(\boldsymbol{v}_1^*, \boldsymbol{v}_2^*, \ldots, \boldsymbol{v}_K^*\) at the same time. This is challenging as \(\hat{\mathcal{J}}_{\mathcal{S}}\) is a non-convex function of \(\mathcal{A}\) and \(\boldsymbol{v}_1, \boldsymbol{v}_2, \ldots, \boldsymbol{v}_K\). We will now fall back on heuristics to find the local optimum. In what follows, we will list the necessary conditions to minimize the objective function \(\hat{\mathcal{J}}_{\mathcal{S}}\).
The Necessary Conditions to Minimize the Objective Function#
With all the definitions in place, we can now formally state the necessary conditions to minimize the objective function. Note a necessary condition only guarantees that if a solution is optimal, then the conditions must be satisfied. However, if a solution does satisfy the conditions, it does not necessarily mean that it is optimal. In short, we may land ourselves with a local minimum that is not globally optimal.
Condition 1: The Optimal Assignment#
(K-Means Optimal Assignment)
Fix the cluster centers \(\boldsymbol{v}_1, \boldsymbol{v}_2, \ldots, \boldsymbol{v}_K\), we seek the optimal assignment \(\mathcal{A}^*(\cdot)\) that minimizes the cost function \(\hat{\mathcal{J}}_{\mathcal{S}}(\cdot)\).
We claim that the optimal assignment \(\mathcal{A}^*(\cdot)\) follows the nearest neighbor rule, which means that,
Then the assignment \(\mathcal{A}^*\) is the optimal assignment that minimizes the cost function \(\hat{\mathcal{J}}_{\mathcal{S}}\).
This is quite intuitive as we are merely assigning each data point \(\mathbf{x}^{(n)}\) to cluster \(k\) whose center \(\boldsymbol{v}_k\) is closest to \(\mathbf{x}^{(n)}\).
We rephrase the claim by saying that for any assignment \(\mathcal{A}\), we have
Let’s prove this claim.
Proof. In (127), we have that \(\boldsymbol{v}_1, \boldsymbol{v}_2, \ldots, \boldsymbol{v}_K\) are fixed.
This is just a proof by definition of \(\mathcal{A}^*\) since
Condition 2: The Optimal Cluster Centers (Centroids)#
(K-Means Optimal Cluster Centers)
Fix the assignment \(\mathcal{A}^*\), we seek the optimal cluster centers \(\boldsymbol{v}_1^*, \boldsymbol{v}_2^*, \ldots, \boldsymbol{v}_K^*\) that minimize the cost function \(\hat{\mathcal{J}}_{\mathcal{S}}\).
We claim that the optimal cluster centers is the mean of the data points assigned to each cluster.
where \(\left|\hat{C}_k^*\right|\) is the number of data points assigned to cluster \(k\). We can denote it as \(N_k\) for convenience.
We can also rephrease this claim by saying that for any cluster centers \(\boldsymbol{v}_1, \boldsymbol{v}_2, \ldots, \boldsymbol{v}_K\), fixing the assignment \(\mathcal{A}^*\), we have
Proof. This proof in short just says that the mean minimizes the sum of squared distances.
Since we established (Theorem 66) that minimizing each individual cluster \(\hat{C}_k\) is equivalent to minimizing the cost function \(\hat{\mathcal{J}}_{\mathcal{S}}\), we can now fix any cluster \(\hat{C}_k\) (i.e. also fixing the assignment \(\mathcal{A}^*\)) and seek the optimal cluster center \(\boldsymbol{v}_k^*\) that minimizes the cost function \(\hat{\mathcal{J}}_{\hat{C}_k}\).
Note after fixing the assignment \(\mathcal{A}^*\), \(\hat{\mathcal{J}}_{\hat{C}_k}\) is now just a function of \(\boldsymbol{v}_k\) and is the cost for that cluster.
We can now take the derivative of \(\hat{\mathcal{J}}_{\hat{C}_k}\) with respect to \(\boldsymbol{v}_k\) and set it to zero to find the optimal cluster center \(\boldsymbol{v}_k^*\).
where \(N_k\) is the number of data points assigned to cluster \(k\).
Now to minimize (130), we set it to zero and solve for \(\boldsymbol{v}_k\).
recovering Criterion 2.
There are other variants of proof.
(Notation)
Notation wise, \(\boldsymbol{v}_k^*\) will be denoted \(\boldsymbol{\mu}_k\) in the following sections.
Objective Function Re-defined#
We can now re-define the objective function (125) in terms of the optimal cluster centers and assignments.
(Cost Function is a function of assignments and cluster centers)
Reminder!
The cost function (131) is a function of the cluster assignments and cluster centers. And therefore we are minimizing the cost function with respect to the cluster assignments and cluster centers. However, jointly optimizing both the cluster assignments and cluster centers is computationally challenging, and therefore we split to two steps, first optimizing the cluster assignments and then optimizing the cluster centers in a greedy manner.
Algorithm#
We are now ready to define the full Lloyd’s algorithm for K-Means.
(Lloyd’s Algorithm (K-Means))
Given a set of data points (samples)
the K-Means algorithm aims to group the data points into \(K\) clusters
such that the sum of squared distances between each data point and its cluster center is minimized.
In code, \(\hat{C}\) can be treated as a dictionary/hash map, where the key is the cluster number and the value is the set of data points assigned to that cluster.
Initialize \(K\) cluster centers \(\boldsymbol{\mu}_1^{(0)}, \boldsymbol{\mu}_2^{(0)}, \dots, \boldsymbol{\mu}_K^{(0)}\) randomly (best to be far apart) where the superscript \((0)\) denotes the iteration number \(t=0\).
In the very first iteration, there are no data points in any cluster \(\hat{C}_k^{(0)} = \emptyset\). Therefore, the cluster centers are just randomly chosen for simplicity.
By random, we mean that the cluster centers are randomly chosen from the data points \(\mathcal{S} = \left\{\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \dots, \mathbf{x}^{(N)}\right\}\) and not randomly chosen from the feature space \(\mathbb{R}^D\).
Subsequent iterations will have data points in the clusters \(\hat{C}_k^{(t)} \neq \emptyset\) and thus \(\boldsymbol{\mu}_k^{(t)}\) will be the mean of the data points in cluster \(k\).
Each \(\boldsymbol{\mu}_k^{(0)}\) is a \(D\)-dimensional vector, where \(D\) is the number of features, and represents the mean vector of all the data points in cluster \(k\).
\[ \boldsymbol{\mu}_k^{(0)} = \begin{bmatrix} \mu_{1k}^{(0)} & \mu_{2k}^{(0)} & \cdots & \mu_{Dk}^{(0)} \end{bmatrix}^{\mathrm{T}} \]and \(\mu_{dk}^{(0)}\) is the mean value of the \(d\)-th feature in cluster \(k\).
We denote \(\boldsymbol{\mu}\) to be the collection of all \(\boldsymbol{\mu}_1, \boldsymbol{\mu}_2, \dots, \boldsymbol{\mu}_K\).
\[\begin{split} \boldsymbol{\mu} = \begin{bmatrix} \boldsymbol{\mu}_1 \\ \boldsymbol{\mu}_2 \\ \vdots \\ \boldsymbol{\mu}_K \end{bmatrix}_{K \times D} \end{split}\]
For \(t=0, 1, 2, \dots\)
Assignment Step (E): Assign each data point \(\mathbf{x}^{(n)}\) to the closest cluster center \(\boldsymbol{\mu}_k^{(t)}\),
(132)#\[\begin{split} \begin{aligned} \hat{y}^{(n) t} := \mathcal{A}^{* (t)}(n) &= \underset{k \in \{1, 2, \ldots, K\}}{\operatorname{argmin}} \left\| \mathbf{x}^{(n)} - \boldsymbol{\mu}_k^{(t)} \right\|^2 \\ \end{aligned} \end{split}\]In other words, \(\hat{y}^{(n) t}\) is the output of the optimal assignment rule at the \(t\)-th iteration and is the index of the cluster center \(\boldsymbol{\mu}_k^{(t)}\) that is closest to \(\mathbf{x}^{(n)}\).
For instance, if \(K = 3\), and for the first sample point \(n=1\), assume the closest cluster center is \(\boldsymbol{\mu}_2^{(t)}\), then the assignment \(\mathcal{A}^{*}\) will assign this point to cluster \(k=2\), \(\hat{y}^{(1)} = 2\). Note that \(\hat{y}^{(n)}\) is a scalar and has the same superscript as \(\mathbf{x}^{(n)}\), indicating they belong to the same sample.
For notational convenience, we can also denote \(\hat{C}_k^{(t)}\) as the set of data points that are assigned to cluster \(k\):
\[ \begin{aligned} \hat{C}_k^{(t)} &= \left\{ \mathbf{x}^{(n)} \mid \hat{y}^{(n)} = k \right\} \end{aligned} \]Mathematically, this means partitioning the data points using Voronoi Diagram, as mentioned in the previous section Definition 142.
Update Step (M): Update the cluster centers for the next iteration.
(133)#\[\begin{split} \begin{aligned} \boldsymbol{\mu}_k^{(t+1)} &= \frac{1}{|\hat{C}_k^{(t)}|} \sum_{\mathbf{x}^{(n)} \in \hat{C}_k^{(t)}} \mathbf{x}^{(n)} \\ \end{aligned} \end{split}\]Notice that the cluster center \(\boldsymbol{\mu}_k^{(t+1)}\) is the mean of all data points that are assigned to cluster \(k\).
Repeat steps 2 and 3 until the centroids stop changing.
(134)#\[ \begin{aligned} \boldsymbol{\mu}_k^{(t+1)} = \boldsymbol{\mu}_k^{(t)} \end{aligned} \]In other words,
\[ \begin{aligned} \hat{\mathcal{J}}_{\mathcal{S}}^{(t+1)}\left(\mathcal{A}^{* (t+1)}, \boldsymbol{\mu}^{(t+1)} \right) = \hat{\mathcal{J}}_{\mathcal{S}}^{(t)}\left(\mathcal{A}^{* (t)}, \boldsymbol{\mu}^{(t)} \right) \end{aligned} \]This is the convergence condition.
(K-Means is a Greedy Algorithm)
It is important to recognize that the K-Means (Lloyd’s) Algorithm optimizes two objectives in an alternating fashion. It alternatively changes both the assignment step \(\mathcal{A}^{*}\) and the update step \(\boldsymbol{\mu}_k^{(t+1)}\) to greedily minimize the cost function \(\hat{\mathcal{J}}(\mathcal{A}, \boldsymbol{\mu})\).
Convergence#
Lemma 1: Stirling Numbers of the Second Kind#
(Stirling Numbers of the Second Kind)
The Stirling Numbers of the Second Kind \(S(n, k)\) are defined as the number of ways to partition a set of \(n\) elements into \(k\) non-empty subsets.
There are at most \(k^n\) ways to partition a set of \(n\) elements into \(k\) non-empty subsets.
In our case, since there are \(N\) data points, and we want to partition them into \(K\) clusters, there are at most \(K^N\) ways to partition the data points into \(K\) clusters.
In other words, the assignment step \(\mathcal{A}\) has at most \(K^N\) possible mappings. The same applies to the update step \(\boldsymbol{\mu}_k\) since \(\boldsymbol{\mu}_k\) is dependent on the assignment step \(\mathcal{A}\).
Lemma 2: Cost Function of K-Means Monotonically Decreases#
(Cost Function of K-Means Monotonically Decreases)
The cost function \(\hat{\mathcal{J}}\) of K-Means monotonically decreases. This means
for each iteration \(t\).
Proof. This is a consequence of Criterion 1 and Criterion 2.
In particular, the objective function \(\hat{\mathcal{J}}\) is made up of two steps, the assignment step and the update step. We minimize the assignment step by finding the optimal assignment \(\mathcal{A}^{*}\), and we minimize the update step by finding the optimal cluster centers \(\boldsymbol{\mu}_k^{*}\) based on the optimal assignment \(\mathcal{A}^{*}\) at each iteration.
Equation (127) shows that for each iteration \(t\), fixing the cluster centers (mean) \(\boldsymbol{\mu}_k^{(t)}\), the assignment step \(\mathcal{A}^{*(t)}\) is optimal.
This means
What this inequality means is that at iteration \(t\), at the assignment step (E), the cost function \(\hat{\mathcal{J}}\) is at least as large as the cost function \(\hat{\mathcal{J}}\) at the next iteration \(t + 1\). This implies that the cost function \(\hat{\mathcal{J}}\) monotonically decreases at this step.
Similarly, at the update step (M), the cost function \(\hat{\mathcal{J}}\) is at least as large as the cost function \(\hat{\mathcal{J}}\) at the next iteration \(t + 1\), fixing the assignment \(\mathcal{A}^{*(t + 1)}\).
Now this means that at both steps, the cost function \(\hat{\mathcal{J}}\) monotonically decreases. So, the cost function \(\hat{\mathcal{J}}\) monotonically decreases at each iteration.
Lemma 3: Monotone Convergence Theorem#
(Monotone Convergence Theorem)
The Monotone Convergence Theorem states that if a sequence of functions \(\{f_n\}\) is non-decreasing and bounded, then the sequence \(\{f_n\}\) converges to a limit.
In our case, the sequence of functions \(\{f_n\}\) is the sequence of cost functions \(\{\hat{\mathcal{J}}^{(t)}\}\), and the limit is the cost function \(\hat{\mathcal{J}}^{*}\).
So it is guaranteed that the sequence of cost functions \(\{\hat{\mathcal{J}}^{(t)}\}\) converges to the cost function \(\hat{\mathcal{J}}^{*}\) (local).
K-Means Converges in Finite Steps#
We are now left to show that the sequence of cost functions \(\{\hat{\mathcal{J}}^{(t)}\}\) is finite, so that \(\{\hat{\mathcal{J}}^{(t)}\}\) converges in finite steps.
Since Lemma 3 states that there exists \(K^N\) possible assignments \(\mathcal{A}\), and simiarly exists \(K^N\) possible cluster centers \(\boldsymbol{\mu}_k\), then there exists \(K^N\) possible cost functions \(\hat{\mathcal{J}}\). Then,
At each iteration \(t\), the cost function \(\hat{\mathcal{J}}^{(t)}\) decreases monotonically.
This means at \(t+1\), if the cost function \(\hat{\mathcal{J}}^{(t + 1)}\) decreases, then this means the assignments \(\mathcal{A}^{*(t + 1)}\) are different from the assignments \(\mathcal{A}^{*(t)}\). Consequently, the partition never repeats if the cost function \(\hat{\mathcal{J}}\) decreases.
This means it will loop over each possible assignment \(\mathcal{A}\), and eventually converge to the unique solution \(\mathcal{A}^{*}\).
For that specific initialization, the algorithm has an unique solution, and it is guaranteed to converge to that solution.
Local Minima#
It is known that K-Means converges in finite steps but does not guarantee convergence to the global minimum. This means that for different initializations, K-Means can converge to different local minima.
We can initialize the algorithm with different initializations, and run the algorithm multiple times. Then, we can choose the best solution among the different local minima.
Hypothesis Space#
For completeness sake, let’s define the hypothesis space \(\mathcal{H}\) for K-Means.
Intuitively, the hypothesis space \(\mathcal{H}\) is the set of all possible clusterings of the data.
Formally, given a set of \(N\) data points \(\{\mathbf{x}^{(n)}\}_{n=1}^N\), let \(C_k\) be the Voronoi cell of the \(k\)-th cluster center \(\boldsymbol{\mu}_k\).
Then, we can write the class of functions \(\mathcal{H}\) as:
This means the hypothesis space \(\mathcal{H}\) is finite with cardinality \(K^N\).
Time and Space Complexity#
Brute Force Search and Global Minimum#
The hypothesis space \(\mathcal{H}\) is finite, implying that if we do a brute force search over all possible clusterings, we can find the global minimum.
Quoting from CMU 10-315, we consider the brute-force search to be the following:
(Brute Force Search for K-Means)
For each possible cluster
induced by the assignment \(\mathcal{A}\) in \(\mathcal{H}\), compute the optimal centroids
where
is the mean of the points in the \(k\)-th cluster.
Then, compute the cost function \(\hat{\mathcal{J}}\) centroids \(\hat{\boldsymbol{\mu}}\).
Repeat this for all possible clusterings \(\mathcal{A}\) in \(\mathcal{H}\) and finally return the clustering \(\hat{C}\) that gives the minimum cost function \(\hat{\mathcal{J}}\).
Then the time complexity of the brute force search is exponential with respect to the number of inputs since there are \(K^N\) possible clusterings and we are looping over each possible clustering to find the global minimum. Indeed, this has time complexity
Lloyd’s Algorithm#
Let \(T\) denote the number of iterations of Lloyd’s algorithm.
Then, the average time complexity of Lloyd’s algorithm is
where \(N\) is the number of data points, \(K\) is the number of clusters, and \(D\) is the number of features.
This can be easily seen in the python implementation written here. We are essentially looping like this:
for t in range(max_iter):
# E Step: Assign each data point to the closest cluster center
for n in range(n_samples):
# compute argmin distance O(KD) since we are looping over all
# K cluster centers and each cluster center has D features
# do assignment which requires you to loop over all
# K cluster centers: O(N)
# so total O(NKD) here already
# M step: Update the cluster centers
for k in range(n_clusters):
# compute the mean of the points in the k-th cluster: O(KD)
# since we are looping over all K cluster centers and each
# cluster center has D features
where the total time complexity approximately \(\mathcal{O}(T N K D)\).
The worst case complexity is given by \(\mathcal{O}(N^{(K+2/D)})\)5.
Train |
Inference |
---|---|
\(\mathcal{O}(NKD)\) |
\(\mathcal{O}(KD)\) |
For space complexity, we need to store the cluster centers \(\boldsymbol{\mu}_k\) and the cluster assignments \(\mathcal{A}(n)\), where the former is a \(K \times D\) matrix and the latter is a \(N\)-dimensional vector. We typically do not include the input data \(\left\{\mathbf{x}^{(n)}\right\}_{n=1}^N\) in the space complexity since it is given. If included that is \(\mathcal{O}(ND)\), totalling \(\mathcal{O}(N + KD + ND)\).
Inference wise, even for a single data point, we need to compute the distance to all cluster centers, so you need to invoke the cluster centers \(\boldsymbol{\mu}_k\), so roughly is \(\mathcal{O}(KD)\).
Train |
Inference |
---|---|
\(\mathcal{O}(KD + N)\) |
\(\mathcal{O}(KD)\) |
Summary#
Advantages#
See google’s article.
Relatively simple to implement.
Scales to large data sets.
Guarantees local convergence.
Can warm-start the positions of centroids.
Easily adapts to new examples.
Generalizes to clusters of different shapes and sizes, such as elliptical clusters.
Disadvantages#
The number of clusters (\(K\)) is specified a priori, which means we need to specify the number of clusters before running the algorithm. Choosing \(K\) may not be straightforward, especially in the case of high-dimensional data.
The Lloyd’s algorithm is sensitive to the initial cluster centers. This means that the algorithm may converge to a local minimum instead of the global minimum. To remedy this, we can run the algorithm multiple times with different initial cluster centers.
K-Means assumes spherical clusters. This is not obvious.
K-Means assumes spherical shape because the algorithm uses the Euclidean distance metric to measure the similarity between observations and centroids. Euclidean distance is a measure of straight-line distance between two points in a Euclidean space, and it assumes that the data is isotropic, meaning that the variance along all dimensions is equal. Now imagine a cluster with an elliptical shape. And imagine the principal axis is quite long, then two points at the extreme ends of the cluster will have a large Euclidean distance. This means that the cluster may be split into two clusters by K-Means, which is not what we want. On the contrary, if the cluster is spherical, then the Euclidean distance between two points at the extreme ends of the cluster will be equidistant to the centroid.
Further quoting the answer here, K-means is a special case of Gaussian Mixture Models (GMM). GMM assumes that the data comes from a mixture of \(K\) Gaussian distributions. In other words, there is a certain probability that the data comes from one of \(K\) of the Gaussian distributions.
If we make the the probability to be in each of the \(K\) Gaussians equal and make the covariance matrices to be \(\sigma^2 \mathbf{I}\), where \(\sigma^2 \) is the same fixed constant for each of the \(K\) Gaussians, and take the limit when \(\sigma^2 \rightarrow 0\) then we get K-means.
So, what does this tell us about the drawbacks of K-means?
K-means leads to clusters that look multivariate Gaussian.
Since the variance across the variables is the same, K-means leads to clusters that look spherical.
Not only do clusters look spherical, but since the covariance matrix is the same across the \(K\) groups, K-means leads to clusters that look like the same sphere.
K-means tends towards equal sized groups.
Overall, if we interpret K-Means from the perspective of probabilistic modeling, then we can see that K-Means is a special case of GMM. And recall that in the geometry of multivariate gaussian, the shape of the multivariate gaussian is determined by the covariance matrix. Since we have deduced that the covariance matrix is \(\sigma^2 \mathbf{I}\), a diagonal matrix with equal variance across all features, then the shape is a sphere since the axis has equal length.
Scaling with number of dimensions. As the number of dimensions increases, a distance-based similarity measure converges to a constant value between any given examples. Reduce dimensionality either by using PCA on the feature data, or by using “spectral clustering” to modify the clustering algorithm.
Best to feature scale if we use Euclidean distance as the distance metric. This is because features with larger scale will dominate the distance metric.
For categorical features, we no longer use mean as the cluster center. Instead, we use the mode.
- 1
Disjoint union indicates that each data point \(\mathbf{x}^{(n)}\) can be assigned to one and only one cluster \(C_k\).
- 2
Note \(C\) is not the same as \(\mathcal{S}\) even though they both represent all samples. The cardinality of \(C\) is \(K\), while the cardinality of \(\mathcal{S}\) is \(N\).
- 3
This means that we are jointly optimizing the assignments \(\mathcal{A}\) and the cluster centers \(\boldsymbol{\mu}_k\).
- 4
There’s actually no ground truth target labels in unsupervised learning, this is for education purposes.
- 5
Refer to “How slow is the k-means method?” D. Arthur and S. Vassilvitskii - SoCG2006. for more details.