# Unsupervised Learning:

## Clustering with the Expectation-Maximization algorithm

The Expectation-Maximization (EM) algorithm is a technique for data clustering which is a task of unsupervised learning. EM belongs to the distribution-based clustering algorithms. Distribution-based means, that a distribution is used to model the dataset / clusters. Therefore usually a Gaussian mixture model (GMM) is used, where every Gaussian corresponds to one cluster. The determination of the most likely distribution using $N$ samples $\mathbf{x}_n$ for $n\in N$ of a dataset is thus the target of the algorithm.

The procedure of the algorithm is as follows. After an Initialization, two steps are iteratively repeated until a stopping criterion is fulfilled. These two steps are often called Expectation-Step and Maximization-Step.

The Initialization consists of random initializing the parameters of the GMM. In particular, the first and second moments of each Gaussian of the GMM and the mixture coefficients have to be set. It is important to note, that the initialization can have a large influence on the convergence speed of the algorithm. Furthermore, the algorithm is not guaranteed to converge to the global optimum, which of course also depends on the initialization. In order to have a good initialization, one usually first performs a k-means clustering which is much faster. The centroids, determined in the k-means algorithm, are used to initialize the means of the GMM.

During the Expectation-Step all samples are given the relevance factors $\gamma_{nk}$ which represent how much sample $\mathbf{x}_n$ belongs to cluster $k \in K$. When using a GMM to model the clusters, the relevance factors are computed based on the probability density functions $N(\mathbf{x}_n | \mu_i,\Sigma_i)$ (here a normal distribution) as well as the mixture coefficients $\pi_k$:

$\gamma_{nk} = \dfrac{\pi_k \cdot N(\mathbf{x}_n | \mu_k,\Sigma_k)}{\sum\limits_{j=1}^K \pi_j \cdot N(\mathbf{x}_n | \mu_j,\Sigma_j)}$

The Update-Step consists of updating the parameters of the GMM according to the samples and their current relevance factors. The equations to update the GMM are:

$N_k = \sum\limits_{n=1}^N \gamma_{nk}\hspace{3cm}$$\mu_k = \dfrac{1}{N_k}\cdot\sum\limits_{n=1}^N \gamma_{nk} \cdot \mathbf{x}_n$

$\Sigma_k = \dfrac{1}{N_k}\cdot\sum\limits_{n=1}^N \gamma_{nk} \cdot (\mathbf{x}_n - \mu_k) \cdot (\mathbf{x}_n - \mu_k)^T\hspace{1cm}$ $\pi_k = \dfrac{N_k}{N}$

## Implementation

In order to implement the algorithm, we first import the required python modules:

import numpy as np                          # Used for numerical computations
import matplotlib.pyplot as plt             # Plotting library
from matplotlib.patches import Ellipse      # Used to visualize Gaussians in 2D
from scipy.stats import multivariate_normal # Used to evaluate the P.D.F.

# This is to set the size of the plots in the notebook
plt.rcParams['figure.figsize'] = [18, 5]


To plot a Gaussian in 2D, we use a function that draws the 1 and 2 sigma ellipse:

def draw_confidence_ellipsoid(ax, mean, cov, c):
L, v = np.linalg.eig(cov)
L = np.sqrt(L)
for s in range(1, 3):
ell = Ellipse(xy=mean, width=L[0]*s*2, height=L[1]*s*2,
edgecolor=c, facecolor='none', linewidth=2)


### Creating a Toy Dataset

To visualize the clustering results we will use a toy dataset. It will contain samples which are drawn from 3 multivariate normal distributions. Each distribution represents a cluster in 2D space.

# Store number of clusters
num_c = 3

true_means = [[3,3], [7,4.5], [4,7]]
true_covars = [[[1.2, 1], [1, 1.2]], [[1.5, 0], [0, 1.5]], [[0.5, -0.25], [-0.25, 0.5]]]
num_samples = [200, 500, 100]

# Generate the samples (3 clusters), we set a fixed seed make the results reproducable
np.random.seed(11)
c1_samples = [np.random.multivariate_normal(true_means[0], true_covars[0])  for i in range(num_samples[0])]
c2_samples = [np.random.multivariate_normal(true_means[1], true_covars[1])  for i in range(num_samples[1])]
c3_samples = [np.random.multivariate_normal(true_means[2], true_covars[2])  for i in range(num_samples[2])]

# Plot the samples, colored by (true) cluster
plt.subplot(1,3,1)
plt.scatter(*zip(*c1_samples), c='red', marker='.', label='Samples of cluster 1')
plt.scatter(*zip(*c2_samples), c='green', marker='.', label='Samples of cluster 2')
plt.scatter(*zip(*c3_samples), c='blue', marker='.', label='Samples of cluster 3')

plt.legend(); plt.xlim(0, 10); plt.ylim(0, 10); plt.show()


Since our goal is to cluster a list of ungrouped samples, we put all samples in one list.

# Concatenate all samples to one list
all_samples = np.array(c1_samples + c2_samples + c3_samples)


### Step 1: Initialization

Now we start the EM algorithm by randomly initializing the mean values of all Gaussians. The covariance matrices are usually set to diagonal matrices with a pre-defined standard deviation. The mixture coefficients are initialized using according to a uniform distribution. Since we initialize one Gaussian per cluster, the number of clusters has to be known in advance!

np.random.seed(11)
# Initialize the means of each Gaussian (here 3 2d-points)
means = [(6, 6) + np.random.randn(2) for i in range(num_c)]
# Initialize the covariance matrices of each Gaussian (here 3 2x2 diagonal matrices with std. dev. = 1)
covars = [np.eye(2) for i in range(num_c)]
# Initialize the mixture coefficients (using a normal distribution)
mix_coefs = [1/num_c for i in range(num_c)]


Next, we plot the ungrouped samples and initialized Gaussians.

plt.subplot(1,3,1)
plt.scatter(*zip(*all_samples), c='black', marker='.', s=10, label='Samples')
for i, c in enumerate(['red', 'green', 'blue']):
plt.scatter(means[i][0], means[i][1], c='black', marker='o', s=150)
plt.scatter(means[i][0], means[i][1], c=c, marker='o', s=50, label='Centroid '+str(i))
draw_confidence_ellipsoid(plt.gca(), means[i], covars[i], c)
plt.legend()
plt.show()


### Step 2: Expectation-Step

Now we compute the relevance factors for each sample and cluster.

mvns = [multivariate_normal(means[i], covars[i]) for i in range(num_c)]
weighted_pdfs = np.array([mvns[i].pdf(all_samples) * mix_coefs[i] for i in range(num_c)])
rel_facs = weighted_pdfs / np.sum(weighted_pdfs, axis=0, keepdims=True)


Let’s visualize the relevance factors for the first cluster by plotting the samples with their size according to the factors.

plt.suptitle('Relevance factors')
for c in range(num_c):
plt.subplot(1, 3, c+1)
plt.scatter(*zip(*all_samples), c='black', marker=".", s=rel_facs[c]*15,
label='Relevances for cluster ' + str(c))
for i, c in enumerate(['red', 'green', 'blue']):
plt.scatter(means[i][0], means[i][1], c='black', marker='o', s=150)
plt.scatter(means[i][0], means[i][1], c=c, marker='o', s=50, label='Mean ' + str(i))
plt.legend()
plt.show()


### Step 3: Maximization-Step

Next, the parameters of the distribution are updated.

Ns = np.sum(rel_facs, axis=1)

weighted_samples = [all_samples*np.repeat(rel_facs[i][:, None], 2, 1) for i in range(num_c)]
# Update the means of each Gaussian
means = [np.sum(weighted_samples[i], axis=0) / Ns[i] for i in range(num_c)]
# Update the covariance matrices of each Gaussian
covars = [np.cov(all_samples.T, aweights=rel_facs[i])  for i in range(num_c)]
# Update the mixture coefficients
mix_coefs = Ns/np.sum(Ns)

plt.subplot(1,3,1)
plt.scatter(*zip(*all_samples), c='black', marker='.', s=10, label='Samples')
for i, c in enumerate(['red', 'green', 'blue']):
plt.scatter(means[i][0], means[i][1], c='black', marker='o', s=150)
plt.scatter(means[i][0], means[i][1], c=c, marker='o', s=50, label='Mean '+str(i))
draw_confidence_ellipsoid(plt.gca(), means[i], covars[i], c)
plt.legend()
plt.show()


### Second iteration

Now we will repeat steps 2 and 3.

mvns = [multivariate_normal(means[i], covars[i]) for i in range(num_c)]
weighted_pdfs = np.array([mvns[i].pdf(all_samples) * mix_coefs[i] for i in range(num_c)])
rel_facs = weighted_pdfs / np.sum(weighted_pdfs, axis=0, keepdims=True)
plt.suptitle('Relevance factors')
for c in range(num_c):
plt.subplot(1, 3, c+1)
plt.scatter(*zip(*all_samples), c='black', marker=".", s=rel_facs[c]*15,
label='Relevances for cluster ' + str(c))
for i, c in enumerate(['red', 'green', 'blue']):
plt.scatter(means[i][0], means[i][1], c='black', marker='o', s=150)
plt.scatter(means[i][0], means[i][1], c=c, marker='o', s=50, label='Mean ' + str(i))
plt.legend()
plt.show()


Ns = np.sum(rel_facs, axis=1)
weighted_samples = [all_samples*np.repeat(rel_facs[i][:, None], 2, 1) for i in range(num_c)]
means = [np.sum(weighted_samples[i], axis=0) / Ns[i] for i in range(num_c)]
covars = [np.cov(all_samples.T, aweights=rel_facs[i])  for i in range(num_c)]
mix_coefs = Ns/np.sum(Ns)

plt.subplot(1,3,1)
plt.scatter(*zip(*all_samples), c='black', marker='.', s=10, label='Samples')
for i, c in enumerate(['red', 'green', 'blue']):
plt.scatter(means[i][0], means[i][1], c='black', marker='o', s=150)
plt.scatter(means[i][0], means[i][1], c=c, marker='o', s=50, label='Mean '+str(i))
draw_confidence_ellipsoid(plt.gca(), means[i], covars[i], c)
plt.legend()
plt.show()


We can see that the clusters moved, although the changes are quite small.

### Further iterating

For the further iterations we use a loop and introduce a stopping criterion which is based on the mean absolute change of the weighted pdfs:

mean_abs_changes = []
tracked_means = []
for i in range(10000):
mvns = [multivariate_normal(means[i], covars[i]) for i in range(num_c)]
weighted_pdfs = np.array([mvns[i].pdf(all_samples) * mix_coefs[i] for i in range(num_c)])
rel_facs = weighted_pdfs / np.sum(weighted_pdfs, axis=0, keepdims=True)

Ns = np.sum(rel_facs, axis=1)
weighted_samples = [all_samples*np.repeat(rel_facs[i][:, None], 2, 1) for i in range(num_c)]
means = [np.sum(weighted_samples[i], axis=0) / Ns[i] for i in range(num_c)]
covars = [np.cov(all_samples.T, aweights=rel_facs[i])  for i in range(num_c)]
mix_coefs = Ns/np.sum(Ns)

tracked_means.append(means)

if i>0:
mean_abs_change = np.mean(np.abs(prev_weighted_pdfs - weighted_pdfs))
mean_abs_changes.append(mean_abs_change)
if mean_abs_change < 0.00001:
break
prev_weighted_pdfs = weighted_pdfs.copy()


In addition to the final clustering result, we want to see the tracked ‘change’ and the movement of the cluster means.

plt.subplot(1,3,1)
plt.title('Mean absolute change of weighted pdfs')
plt.plot(mean_abs_changes)

plt.subplot(1,3,2)
plt.title("'Path' of means")
plt.scatter(*zip(*all_samples), c='black', marker='.', s=10, label='Samples')
for i, c in enumerate(['red', 'green', 'blue']):
xvs = [m[i][0] for m in tracked_means]
yvs = [m[i][1] for m in tracked_means]
plt.plot(xvs, yvs, c=c, linewidth=5)

plt.subplot(1,3,3)
plt.title('Final clustering result')
plt.scatter(*zip(*all_samples), c='black', marker='.', s=10, label='Samples')
for i, c in enumerate(['red', 'green', 'blue']):
plt.scatter(means[i][0], means[i][1], c='black', marker='o', s=150)
plt.scatter(means[i][0], means[i][1], c=c, marker='o', s=50, label='Mean '+str(i))
draw_confidence_ellipsoid(plt.gca(), means[i], covars[i], c)
plt.legend(); plt.show()


### Comparison of the results with the true distribution

Finally, we want to see how well the algorithm performed. Therefore, we first visually compare the resulting distribution with the distributions from which the samples were actually drawn from.

plt.subplot(1,3,1)
plt.title('Clustering result')
plt.scatter(*zip(*all_samples), c='black', marker='.', s=10, label='Samples')
for i, c in enumerate(['red', 'green', 'blue']):
plt.scatter(means[i][0], means[i][1], c='black', marker='o', s=150)
plt.scatter(means[i][0], means[i][1], c=c, marker='o', s=50, label='Mean '+str(i))
draw_confidence_ellipsoid(plt.gca(), means[i], covars[i], c)
plt.legend()

plt.subplot(1,3,2)
plt.title('True distribution')
plt.scatter(*zip(*all_samples), c='black', marker='.', s=10, label='Samples')
for i, c in enumerate(['red', 'green', 'blue']):
plt.scatter(true_means[i][0], true_means[i][1], c='black', marker='o', s=150)
plt.scatter(true_means[i][0], true_means[i][1], c=c, marker='o', s=50, label='Mean '+str(i))
draw_confidence_ellipsoid(plt.gca(), true_means[i], true_covars[i], c)
plt.legend(); plt.show()


We can see that the Gaussians are quite similar to the actual ones. It is also observable, that two cluster IDs are swapped. This is not surprising since the algorithm doesn’t care about the IDs. For further comprehensions, we will bring the predicted clusters in the correct ordering.

# Swap parameters of first and second Gaussian
means[0], means[1] = means[1], means[0]
covars[0], covars[1] = covars[1], covars[0]
mix_coefs[0], mix_coefs[1] = mix_coefs[1], mix_coefs[0]


Let’s next take a look at the mixture coefficients

print('Cluster:                   1  /   2   /   3   ')
print('Mixture coefficients:   {:.3f} / {:.3f} / {:.3f}'.format(*mix_coefs))
print('True ratio of clusters: {:.3f} / {:.3f} / {:.3f}'.format(*num_samples/np.sum(num_samples)))

Cluster:                   1  /   2   /   3
Mixture coefficients:   0.250 / 0.620 / 0.130
True ratio of clusters: 0.250 / 0.625 / 0.125


We can observe that the algorithm managed to determine the cluster priors (relative amount of samples per cluster) quite well! In the last step we compare the final parameters of the GMM to the parameters of the true distributions:

for i in range(num_c):
print('Cluster', i)
print('  Mean:')
print('    GMM:  {:.3f}, {:.3f}'.format(*means[i]))
print('    True: {:.3f}, {:.3f}'.format(*true_means[i]))
print('  Covariance matrix:')
print('    GMM:  [{:.3f}, {:.3f}'.format(*covars[i][0]))
print('           {:.3f}, {:.3f}]'.format(*covars[i][1]))
print('    True: [{:.3f}, {:.3f}'.format(*true_covars[i][0]))
print('           {:.3f}, {:.3f}]'.format(*true_covars[i][1]))

Cluster 0
Mean:
GMM:  3.011, 3.007
True: 3.000, 3.000
Covariance matrix:
GMM:  [1.115, 0.936
0.936, 1.135]
True: [1.200, 1.000
1.000, 1.200]
Cluster 1
Mean:
GMM:  6.981, 4.535
True: 7.000, 4.500
Covariance matrix:
GMM:  [1.580, -0.012
-0.012, 1.518]
True: [1.500, 0.000
0.000, 1.500]
Cluster 2
Mean:
GMM:  4.017, 7.037
True: 4.000, 7.000
Covariance matrix:
GMM:  [0.393, -0.209
-0.209, 0.551]
True: [0.500, -0.250
-0.250, 0.500]


We see that the resulting parameters of the GMM match the true parameters of each normal distribution quite close! Please keep in mind, that we used a GMM to model the clusters here, which is working very well if the clusters are actually normal distributed!