Clustering
K-Means Algorithm
For a given dataset
x(1),…,x(n) without any labels
y(i), clustering is the task of finding a partition of the data into subsets (clusters) such that the data points in the same cluster are more similar to each other than to those in other clusters. For this purpose, we usually use the k-means algorithm.
Initialize cluster centroids μ1,…,μk randomly Repeat until convergence { For each i, set { c(i)←argjmin∥x(i)−μj∥2 For each j, set { μj←∑i=1n1{c(i)=j}∑i=1n1{c(i)=j}x(i) The first loop gives us the centroid
c(i) that is closest to each data point
x(i). The second loop updates the centroid
μj for each cluster
j as the mean of all data points
x(i) assigned to that cluster.
The distortion function (that measures the sum of the squared distances between each data point and its assigned centroid) can be written as:
J(c,μ)=i=1∑n∥x(i)−μc(i)∥2 In each step of the k-means algorithm,
J either decreases or stays the same. Therefore, the algorithm is guaranteed to converge. However, the convergence is not guaranteed to reach a global minimum since
J is non-convex.