Medium Last updated on May 7, 2022, 1:27 a.m.
K-Means is one of the most popular clustering algorithms. It comes in the category of unsupervised learning. Before we dive into K-Means algorithm details, lets’ understand the basics:
Given a collection of data cases $x_{i} ∈ R^{D} $, partition the data cases into groups such that the data cases within each partition are more similar to each other than they are to data cases in other partitions.
In simpler terms, Suppose we have N data cases $D = [x_{i}]_{i=1:N}$.
A clustering of the N cases into K clusters is a partitioning of $D$ into $K$ mutually disjoint subsets $C = {C_{1},…,C_{K}}$ such that $C_{1} ∪ … ∪ C_{K} = D$.
The K-Means algorithm is an iterative optimization algorithm for clustering that alternates between two steps. The algorithm maintains a set of K cluster centroids or prototypes $μ_{k}$ that represent the average (mean) feature vector of the data cases in each cluster. Suppose we let $z_{i}$ indicate which cluster $x_{i}$ belongs to and $μ_{k} ∈ R_{D}$ be the cluster centroid/prototype for cluster $k$. The two main steps of the algorithm can then be expressed as follows:
Step 1: The distance between each data case and each prototype is computed, and each data case is assigned to the nearest prototype.
$$ z_{i}=argmin_{k} ||μ_{k}−x_{i}||_{2}^2 $$
Step 2: The prototypes are updated to the mean of the data cases assigned to them.
$$ μ_{k} = \frac{\sum_{i=1}^{N}[z_{i}=k]x_{i}}{\sum_{i=1}^{N}[z_{i}=k]} $$
This iterative two-step process keeps on running till we reach the objective of Kmeans, which is to minimize the sum of the within-cluster variation overall clusters (also called the within-cluster sum of squares):
$$ C = argmin_{C} \sum_{k=1}^{K} \frac{1}{|C_{k}|} \sum_{x_{i},x_{j} ∈ C_{k}} ||x_{i}-x_{j}||_{2}^{2}|$$
Note:
The above objective function of K-means has many local optima in general, each corresponding to a different clustering of the data. Therefore, K-Means is guaranteed to converge to some local optima. However, finding the global optimum is not computationally tractable.
One of the best approaches is to run a grid search over a range of k
values and try to minimize/maximize some clustering metric scores. Suppose, if we have insight into labels of data we can use homogeneity score to determine the quality of clusters, or if we have all unsupervised data we can use silhouette score to assess a ratio of inter-cluster vs intracluster distance. For more info, check out clustering metrics of scikit-learn.
One of the key steps of K-means is cluster initialization. As K-means finds a local optimum, it can be highly sensitive to initialization. Commonly used initializations are:
1. Set the initial centers to be randomly selected data cases.
2. Set the initial partition to a random partition.
3. Selecting centers using a “furthest first”-style heuristic (more formally known as K-Means++)
It is also common to perform multiple random re-starts of the algorithm and take the clustering with a minimal total variation.
Distance Function: K-means only works with Euclidean distance. An alternate version based on Manhattan distance exists and is called the K-medians algorithm.
Data preprocessing requirements: Pre-processing steps like re-scaling/normalizing features can completely change the results.
Search for Optimal K value: K-Means require optimal k values in order to cluster on salient differences between data cases, not noise.
Time Complexity: The run time is O(NKT) where T is the number of iterations to the convergence of the total variation. T is often small, but examples can be constructed that require an exponential number of steps to converge.
Sensitive to Outliers