Unsupervised Learning:

Clustering with the k-means algorithm


This page can be downloaded as interactive jupyter notebook


The k-means algorithm is a technique for clustering which is a task of unsupervised learning. It was first introduced by MacQueen, J. in Some methods for classification and analysis of multivariate observations (1967). K-means is an algorithm for centroid-based clustering of a dataset. Centroid-based means, that each cluster is described by a single centroid. The determination of the centroids of each cluster using the samples of the dataset is thus the target of the algorithm.

The procedure of the algorithm is as follows. After an Initialization, two steps are iteratively repeated until a stopping criterion is fulfilled. These two steps are here called Assignment-Step and Update-Step.

The Initialization of the k-means algorithm is quite simple. It only requires to place the centroids randomly in the feature space. It is important to note, that the initialization can have a large influence on the convergence speed of the algorithm. Furthermore, the algorithm is not guaranteed to converge to the global optimum, which of course also depends on the initialization.

During the Assignment-Step all samples are assigned to the most similar centroid. The similarity is often described by a distance metric in feature space like the L1 or L2 norm of the distance vector.

The Update-Step consists of updating the centroids. They are moved to the mean value of all samples, which are assigned to the corresponding cluster.


Implementation

In order to implement the algorithm, we first import the required python modules:

import numpy as np                          # Used for numerical computations
import matplotlib.pyplot as plt             # Plotting library  

# This is to set the size of the plots in the notebook
plt.rcParams['figure.figsize'] = [6, 6]    

Creating a Toy Dataset

To visualize the clustering results we use a toy dataset. It will contain samples which are drawn from 3 normal distributions. Each distribution represents a cluster in 2D space.

# Store number of clusters
num_c = 3 

# Generate the samples (3 clusters), we set a fixed seed make the results reproducable
np.random.seed(0)
c1_samples = [(2, 2) + np.random.randn(2) for i in range(100)]
c2_samples = [(7, 3) + np.random.randn(2) for i in range(100)]
c3_samples = [(3, 7) + np.random.randn(2) for i in range(100)]

# Plot the samples, colored by (true) cluster
plt.scatter(*zip(*c1_samples), c='red', marker='.', label='Samples of cluster 1')
plt.scatter(*zip(*c2_samples), c='green', marker='.', label='Samples of cluster 2')
plt.scatter(*zip(*c3_samples), c='blue', marker='.', label='Samples of cluster 3')
plt.legend()
plt.show()

png

Since our goal is to cluster a list of ungrouped samples, we put all samples in one list.

# Concatenate all samples to one list
all_samples = np.array(c1_samples + c2_samples + c3_samples)

Step 1: Initialization

Now we start the k-means algorithm by randomly initializing the centroids. Since we initialize one centriod per cluster, the number of clusters has to be known in advance!

# Initialize the centroids (here 3 2d-points), we set a fixed seed make the results reproducable
np.random.seed(0)
centroids = [(4, 4) + np.random.randn(2) for i in range(num_c)]

Next, we plot the ungrouped samples and initialized centroids

plt.scatter(*zip(*all_samples), c='black', marker='.', label='Samples')
for i, c in enumerate(['red', 'green', 'blue']):
    plt.scatter(centroids[i][0], centroids[i][1], c='black', marker='o', s=150)
    plt.scatter(centroids[i][0], centroids[i][1], c=c, marker='o', s=50, label='Centroid '+str(i))
plt.legend()
plt.show()

png

Step 2: Assignment

Now we assign each sample to the closest centroid. As a distance metric we choose the Euclidian distance. Note that we never compute the true Euclidian distance, but instead the squared Euclidian distance. This is allowed since we are only interested in the argmin of the distances and

$argmin(a, b, c) = argmin(a^2,b^2,c^2)$.

When operating with large datasets and/or in high dimensional feature spaces, this trick can speed-up the algorithm a lot. The code in the next cell will compute all squared distances. Based on these distances each sample is assigned to the closest centroid. The results are plotted.

# Compute absolute differences and squared distances
diffs = np.array([all_samples - centroid for centroid in centroids])
sq_dists = np.sum(np.square(diffs), axis=2)

# Calculate the ID of the closest centroid for each sample 
assignments = np.argmin(sq_dists, axis=0)

# Create a list of assigned samples per centroid
assigned_samples = [all_samples[assignments==i] for i in range(3)]

# Plot assigned samples and centroids
for i, c in enumerate(['red', 'green', 'blue']):
    plt.scatter(*zip(*assigned_samples[i]), c=c, marker='.', label='Assigned to Cl. ' + str(i))
    plt.scatter(centroids[i][0], centroids[i][1], c='black', marker='o', s=150)
    plt.scatter(centroids[i][0], centroids[i][1], c=c, marker='o', s=50, label='Cl. ' + str(i))
plt.legend(); plt.show()

png

Step 3: Update

Next, the centroids are updated. Therefore the mean values of the corresponding lists of assigned samples are calculated.

# Compute mean of assigned samples per centroid
centroids = [np.mean(samples, axis=0) for samples in assigned_samples]

# Plot assigned samples and centroids
for i, c in enumerate(['red', 'green', 'blue']):
    plt.scatter(*zip(*assigned_samples[i]), c=c, marker='.')
    plt.scatter(centroids[i][0], centroids[i][1], c='black', marker='o', s=150)
    plt.scatter(centroids[i][0], centroids[i][1], c=c, marker='o', s=50, label='Cl. ' + str(i))
plt.legend(); plt.show()

png

Repeat steps 2 & 3

We now repeat the steps 2 and 3 for a fixed number of iterations or until the assignments no longer change.

Iteration 2:

# Repeat step 2
sq_dists = np.sum(np.square(np.array([all_samples - centroid for centroid in centroids])), axis=2)
assignments = np.argmin(sq_dists, axis=0)
assigned_samples = [all_samples[assignments==i] for i in range(3)]

# Repeat step 3
centroids = [np.mean(samples, axis=0) for samples in assigned_samples]
for i, c in enumerate(['red', 'green', 'blue']):
    plt.scatter(*zip(*assigned_samples[i]), c=c, marker='.')
    plt.scatter(centroids[i][0], centroids[i][1], c='black', marker='o', s=150)
    plt.scatter(centroids[i][0], centroids[i][1], c=c, marker='o', s=50, label='Cl. '+str(i))
plt.legend(); plt.show()

png

Iteration 3:

# Repeat step 2
sq_dists = np.sum(np.square(np.array([all_samples - centroid for centroid in centroids])), axis=2)
assignments = np.argmin(sq_dists, axis=0)
assigned_samples = [all_samples[assignments==i] for i in range(3)]

# Repeat step 3
centroids = [np.mean(samples, axis=0) for samples in assigned_samples]
for i, c in enumerate(['red', 'green', 'blue']):
    plt.scatter(*zip(*assigned_samples[i]), c=c, marker='.')
    plt.scatter(centroids[i][0], centroids[i][1], c='black', marker='o', s=150)
    plt.scatter(centroids[i][0], centroids[i][1], c=c, marker='o', s=50, label='Cl. '+str(i))
plt.legend(); plt.show()

png

Iteration 4:

# Repeat step 2
sq_dists = np.sum(np.square(np.array([all_samples - centroid for centroid in centroids])), axis=2)
assignments = np.argmin(sq_dists, axis=0)
assigned_samples = [all_samples[assignments==i] for i in range(3)]

# Repeat step 3
centroids = [np.mean(samples, axis=0) for samples in assigned_samples]
for i, c in enumerate(['red', 'green', 'blue']):
    plt.scatter(*zip(*assigned_samples[i]), c=c, marker='.')
    plt.scatter(centroids[i][0], centroids[i][1], c='black', marker='o', s=150)
    plt.scatter(centroids[i][0], centroids[i][1], c=c, marker='o', s=50, label='Cl. '+str(i))
plt.legend(); plt.show()

png

We can see, that the assignments only changed slightly. Furthermore, we can observe, that the final clusters are representing the true clusters quite well!

Additional experiments

In order to run several experiments, we write a function to perform the k-means algorithm. The code is mostly the same as above. We just initialize the centroids according to the first three samples. Furthermore, a stopping criterion is added (when the centroids are not changing anymore).

def k_means(samples, num_clusters, max_num_iter=100):
    # Initialization
    centroids = [samples[np.random.randint(len(samples))] for i in range(num_clusters)]
    for it in range(max_num_iter):
        # Assignment step
        sq_dists = np.sum(np.square(np.array([all_samples - centroid for centroid in centroids])), axis=2)
        assignments = np.argmin(sq_dists, axis=0)
        assigned_samples = [samples[assignments==i] for i in range(num_clusters)]
        # Update step
        new_centroids = []
        for i in range(num_clusters):
            if len(assigned_samples[i]) > 0:
                new_centroids.append(np.mean(assigned_samples[i], axis=0))
            else:
                new_centroids.append(centroids[i])
        # Stopping condition
        if np.array_equal(new_centroids, centroids):
            break
        centroids = new_centroids
    return centroids, assigned_samples, it

First, we want to check if the function works:

np.random.seed(0)
centroids, assigned_samples, it = k_means(all_samples, num_c)
print('Agorithm finished after {} iterations.'.format(it))

for i, c in enumerate(['red', 'green', 'blue']):
    plt.scatter(*zip(*assigned_samples[i]), c=c, marker='.')
    plt.scatter(centroids[i][0], centroids[i][1], c='black', marker='o', s=150)
    plt.scatter(centroids[i][0], centroids[i][1], c=c, marker='o', s=50, label='C_'+str(i))
plt.legend(); plt.show()
Agorithm finished after 4 iterations.

png

Influence of random initialization

In order to see the influence of the random initialization on the convergence, we run the above experiment 1000 times and keep track of the number of iterations, the algorithm needed to converge.

np.random.seed(0)
counters = dict() 
for i in range(1000):
    _, _, it = k_means(all_samples, num_c)
    if it in counters:
        counters[it]+=1
    else:
        counters[it]=1

plt.bar(*zip(*list(counters.items())))
plt.show()

png

We observe that must runs took about 2-4 iterations. Of course, this number highly depends on the dataset!

Influence of the number of clusters

In the above experiments, we knew the correct number of clusters in advance. In real applications this is often not the case. Let’s therefore visualize what happens if we choose a different (wrong) number of clusters.

Assuming 2 clusters

First, we will assume that we can approximate our data with two clusters.

np.random.seed(0)
centroids, assigned_samples, it = k_means(all_samples, 2)

print('Agorithm finished after {} iterations.'.format(it))

for i, c in enumerate(['red', 'green']):
    plt.scatter(*zip(*assigned_samples[i]), c=c, marker='.')
    plt.scatter(centroids[i][0], centroids[i][1], c='black', marker='o', s=150)
    plt.scatter(centroids[i][0], centroids[i][1], c=c, marker='o', s=50, label='C_'+str(i))
plt.legend(); plt.show()
Agorithm finished after 3 iterations.

png

The algorithm still finds a way to group the data. In this setting during most runs, one of the found clusters represents a true cluster, while the other two are grouped to the second found cluster.

Assuming 4 clusters

Now we will set the number of clusters to 4.

np.random.seed(0)
centroids, assigned_samples, it = k_means(all_samples, 4)

print('Agorithm finished after {} iterations.'.format(it))

for i, c in enumerate(['red', 'green', 'blue', 'orange']):
    plt.scatter(*zip(*assigned_samples[i]), c=c, marker='.')
    plt.scatter(centroids[i][0], centroids[i][1], c='black', marker='o', s=150)
    plt.scatter(centroids[i][0], centroids[i][1], c=c, marker='o', s=50, label='C_'+str(i))
plt.legend(); plt.show()
Agorithm finished after 9 iterations.

png

Here, usually two clusters are found ‘correct’ and the last one is further divided in two groups.

Assuming 6 clusters

In the last experiment we set the number of clusters to 6.

np.random.seed(0)
centroids, assigned_samples, it = k_means(all_samples, 6)

print('Agorithm finished after {} iterations.'.format(it))

for i, c in enumerate(['red', 'green', 'blue', 'orange', 'cyan', 'magenta']):
    plt.scatter(*zip(*assigned_samples[i]), c=c, marker='.')
    plt.scatter(centroids[i][0], centroids[i][1], c='black', marker='o', s=150)
    plt.scatter(centroids[i][0], centroids[i][1], c=c, marker='o', s=50, label='C_'+str(i))
plt.legend(); plt.show()
Agorithm finished after 15 iterations.

png

You may have expected that all three ‘true’ clusters are divided in two groups, but here one cluster is found correctly while the other two are divided in two and three groups.

We can conclude, that even when using a ‘wrong’ number of clusters, the algorithm still converges. Whether the result is usable or not may depend on the application.


Author: Dennis Wittich
Last modified: 02.05.2019