The EM Algorithm

Probability & Statistics Series 8 / 13

The Latent Variable Problem

Maximum Likelihood Estimation works beautifully when we observe all the variables in our model. But many real-world problems involve latent (hidden) variables — quantities we can’t observe directly but that explain the structure of our data.

Consider clustering: we observe data points but not which cluster each point belongs to. The cluster assignments are latent variables. We can’t directly maximize the likelihood because it involves summing over all possible assignments of these hidden variables — an intractable computation.

The Expectation-Maximization (EM) algorithm solves this by iterating between two steps: inferring the latent variables (E-step) and updating the parameters (M-step).

The Setup

We have:

Observed data: $\mathbf{X} = \{x_1, \ldots, x_n\}$
Latent variables: $\mathbf{Z} = \{z_1, \ldots, z_n\}$
Parameters: $\theta$

The complete-data log-likelihood — what we’d maximize if we could see everything — is:

\log P(\mathbf{X}, \mathbf{Z} \mid \theta)

But we only observe $\mathbf{X}$ , so we need the marginal (incomplete) log-likelihood:

\log P(\mathbf{X} \mid \theta) = \log \sum_{\mathbf{Z}} P(\mathbf{X}, \mathbf{Z} \mid \theta)

The sum (or integral) inside the log makes direct optimization difficult. EM sidesteps this.

The Algorithm

Starting from an initial guess $\theta^{(0)}$ , iterate:

E-Step (Expectation)

Compute the expected complete-data log-likelihood, using the current parameters to “fill in” the latent variables:

Q(\theta \mid \theta^{(t)}) = \mathbb{E}_{\mathbf{Z} \mid \mathbf{X}, \theta^{(t)}}\!\left[\log P(\mathbf{X}, \mathbf{Z} \mid \theta)\right]

This computes the posterior distribution over latent variables $P(\mathbf{Z} \mid \mathbf{X}, \theta^{(t)})$ and uses it to take an expectation.

M-Step (Maximization)

Maximize $Q$ with respect to the parameters:

\theta^{(t+1)} = \arg\max_{\theta} \, Q(\theta \mid \theta^{(t)})

This is usually a standard MLE problem on the “completed” data, which is often much easier than the original problem.

Convergence Guarantee

EM never decreases the marginal log-likelihood:

\log P(\mathbf{X} \mid \theta^{(t+1)}) \geq \log P(\mathbf{X} \mid \theta^{(t)})

This monotonic improvement guarantees convergence to a local maximum (or saddle point) of the likelihood. However, EM does not guarantee finding the global maximum — the result depends on initialization.

Gaussian Mixture Models: The Classic Application

A Gaussian Mixture Model (GMM) assumes data is generated from $K$ Gaussian components:

P(x \mid \theta) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(x \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)

where:

$\pi_k$ — mixing weights ( $\sum_k \pi_k = 1$ )
$\boldsymbol{\mu}_k$ — mean of component $k$
$\boldsymbol{\Sigma}_k$ — covariance of component $k$

The latent variable $z_i \in \{1, \ldots, K\}$ indicates which component generated data point $x_i$ .

E-Step: Compute Responsibilities

For each data point $x_i$ and component $k$ , compute the responsibility — the posterior probability that component $k$ generated $x_i$ :

\gamma_{ik} = P(z_i = k \mid x_i, \theta^{(t)}) = \frac{\pi_k \, \mathcal{N}(x_i \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^{K} \pi_j \, \mathcal{N}(x_i \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}

This is Bayes’ theorem in action: the prior $\pi_k$ is updated by the likelihood $\mathcal{N}(x_i \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$ .

M-Step: Update Parameters

Using the responsibilities as soft assignments, update each component’s parameters:

N_k = \sum_{i=1}^{n} \gamma_{ik} \quad \text{(effective number of points in component } k\text{)}

\boldsymbol{\mu}_k^{\text{new}} = \frac{1}{N_k} \sum_{i=1}^{n} \gamma_{ik} \, x_i

\boldsymbol{\Sigma}_k^{\text{new}} = \frac{1}{N_k} \sum_{i=1}^{n} \gamma_{ik} (x_i - \boldsymbol{\mu}_k^{\text{new}})(x_i - \boldsymbol{\mu}_k^{\text{new}})^\top

\pi_k^{\text{new}} = \frac{N_k}{n}

These are weighted versions of the standard MLE formulas — each data point contributes to each component proportionally to its responsibility.

Implementation

import numpy as np
from scipy.stats import multivariate_normal

def em_gmm(X, K, max_iter=100, tol=1e-6):
    n, d = X.shape

    # Initialize
    means = X[np.random.choice(n, K, replace=False)]
    covs = [np.eye(d)] * K
    weights = np.ones(K) / K

    log_likelihood = -np.inf

    for iteration in range(max_iter):
        # E-step: compute responsibilities
        resp = np.zeros((n, K))
        for k in range(K):
            resp[:, k] = weights[k] * multivariate_normal.pdf(X, means[k], covs[k])
        resp /= resp.sum(axis=1, keepdims=True)

        # M-step: update parameters
        Nk = resp.sum(axis=0)
        for k in range(K):
            means[k] = (resp[:, k] @ X) / Nk[k]
            diff = X - means[k]
            covs[k] = (resp[:, k:k+1] * diff).T @ diff / Nk[k]
        weights = Nk / n

        # Check convergence
        new_ll = np.sum(np.log(sum(
            weights[k] * multivariate_normal.pdf(X, means[k], covs[k])
            for k in range(K)
        )))
        if abs(new_ll - log_likelihood) < tol:
            break
        log_likelihood = new_ll

    return means, covs, weights, resp

Why EM Works: The ELBO

The theoretical foundation of EM relies on the Evidence Lower Bound (ELBO). For any distribution $q(\mathbf{Z})$ over the latent variables:

\log P(\mathbf{X} \mid \theta) = \underbrace{\mathbb{E}_q[\log P(\mathbf{X}, \mathbf{Z} \mid \theta)] - \mathbb{E}_q[\log q(\mathbf{Z})]}_{\text{ELBO}} + \text{KL}(q \| P(\mathbf{Z} \mid \mathbf{X}, \theta))

Since KL divergence is non-negative, the ELBO is always a lower bound on the log-likelihood.

E-step: Set $q(\mathbf{Z}) = P(\mathbf{Z} \mid \mathbf{X}, \theta^{(t)})$ , which makes KL = 0 and the ELBO touches the log-likelihood
M-step: Maximize the ELBO with respect to $\theta$ , which raises the log-likelihood

This perspective connects EM directly to variational inference, which generalizes the E-step to cases where the exact posterior over $\mathbf{Z}$ is intractable.

EM for Other Models

Missing Data

If some entries in a dataset are missing, EM treats them as latent variables:

E-step: Impute missing values using current parameter estimates
M-step: Fit the model on the completed data

This is the principled foundation behind many imputation methods.

Hidden Markov Models (HMMs)

For sequential data with hidden states:

E-step: The forward-backward algorithm computes posterior probabilities over hidden state sequences
M-step: Update transition and emission probabilities

HMMs are used in speech recognition, bioinformatics, and time series analysis.

Latent Dirichlet Allocation (LDA)

For topic modeling:

E-step: Infer document-topic and word-topic distributions
M-step: Update global topic-word distributions

In practice, the E-step for LDA is intractable, so variational EM or Gibbs sampling is used instead.

Practical Considerations

Initialization Matters

EM finds local optima, so initialization is critical:

K-means++: Run K-means first and use the result to initialize GMM parameters
Multiple restarts: Run EM several times with random initializations and keep the best result
Hierarchical: Start with fewer components and split

Choosing $K$

The number of components must be chosen. Common approaches:

Method	Idea
BIC (Bayesian Information Criterion)	$-2 \log L + k \log n$ — penalizes model complexity
AIC (Akaike Information Criterion)	$-2 \log L + 2k$ — less penalty than BIC
Cross-validation	Evaluate held-out log-likelihood
Bayesian nonparametrics	Dirichlet Process mixtures learn $K$ from data

Singularities

If a Gaussian component collapses onto a single data point, its variance goes to zero and the likelihood goes to infinity. Prevent this by:

Adding a small regularization term to the covariance: $\boldsymbol{\Sigma}_k + \epsilon \mathbf{I}$
Using Bayesian priors on the covariance (MAP-EM)
Monitoring for degenerate components and removing them

Convergence Speed

EM can converge slowly, especially near the optimum. Acceleration techniques:

Aitken acceleration: Extrapolate the convergence trajectory
ECME (Expectation-Conditional Maximization Either): Partial M-steps
Variational EM: Approximate E-step for faster computation

EM vs Gradient-Based Optimization

Why not just use gradient descent on the marginal log-likelihood?

Aspect	EM	Gradient Descent
Constraints	Naturally handles probability constraints	Needs projections or reparameterization
Convergence	Monotonic, guaranteed to not decrease	Can oscillate
Speed	Can be slow near optimum	Adaptive methods (Adam) often faster
Implementation	Closed-form M-step for many models	Automatic differentiation
Local optima	Yes	Yes

In modern deep learning, gradient-based methods with the reparameterization trick (as in VAEs) have largely replaced EM for complex latent variable models. But EM remains the method of choice for classical mixture models and missing data problems.

Summary

EM iterates between inferring latent variables (E-step) and optimizing parameters (M-step)
Each iteration is guaranteed to improve the log-likelihood
GMMs are the canonical application: soft clustering with probabilistic assignments
EM is founded on the ELBO, connecting it to variational inference
Initialization is critical — use K-means++ or multiple restarts
EM handles missing data, HMMs, and topic models
EM converges to local optima; for global optimization, use multiple restarts or Bayesian approaches

References

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). “Maximum Likelihood from Incomplete Data via the EM Algorithm.” Journal of the Royal Statistical Society B, 39(1), 1—38.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 9.
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 8.
McLachlan, G. J., & Krishnan, T. (2008). The EM Algorithm and Extensions (2nd ed.). Wiley.
Neal, R. M., & Hinton, G. E. (1998). “A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants.” In Learning in Graphical Models. Springer.