The EM Algorithm

Expectation-Maximization: how to fit models with latent variables, from Gaussian Mixture Models to missing data problems.

Probability & Statistics March 6, 2026 8 min read

The Latent Variable Problem

Maximum Likelihood Estimation works beautifully when we observe all the variables in our model. But many real-world problems involve latent (hidden) variables — quantities we can’t observe directly but that explain the structure of our data.

Consider clustering: we observe data points but not which cluster each point belongs to. The cluster assignments are latent variables. We can’t directly maximize the likelihood because it involves summing over all possible assignments of these hidden variables — an intractable computation.

The Expectation-Maximization (EM) algorithm solves this by iterating between two steps: inferring the latent variables (E-step) and updating the parameters (M-step).

The Setup

We have:

  • Observed data: X={x1,,xn}\mathbf{X} = \{x_1, \ldots, x_n\}
  • Latent variables: Z={z1,,zn}\mathbf{Z} = \{z_1, \ldots, z_n\}
  • Parameters: θ\theta

The complete-data log-likelihood — what we’d maximize if we could see everything — is:

logP(X,Zθ)\log P(\mathbf{X}, \mathbf{Z} \mid \theta)

But we only observe X\mathbf{X}, so we need the marginal (incomplete) log-likelihood:

logP(Xθ)=logZP(X,Zθ)\log P(\mathbf{X} \mid \theta) = \log \sum_{\mathbf{Z}} P(\mathbf{X}, \mathbf{Z} \mid \theta)

The sum (or integral) inside the log makes direct optimization difficult. EM sidesteps this.

The Algorithm

Starting from an initial guess θ(0)\theta^{(0)}, iterate:

E-Step (Expectation)

Compute the expected complete-data log-likelihood, using the current parameters to “fill in” the latent variables:

Q(θθ(t))=EZX,θ(t) ⁣[logP(X,Zθ)]Q(\theta \mid \theta^{(t)}) = \mathbb{E}_{\mathbf{Z} \mid \mathbf{X}, \theta^{(t)}}\!\left[\log P(\mathbf{X}, \mathbf{Z} \mid \theta)\right]

This computes the posterior distribution over latent variables P(ZX,θ(t))P(\mathbf{Z} \mid \mathbf{X}, \theta^{(t)}) and uses it to take an expectation.

M-Step (Maximization)

Maximize QQ with respect to the parameters:

θ(t+1)=argmaxθQ(θθ(t))\theta^{(t+1)} = \arg\max_{\theta} \, Q(\theta \mid \theta^{(t)})

This is usually a standard MLE problem on the “completed” data, which is often much easier than the original problem.

Convergence Guarantee

EM never decreases the marginal log-likelihood:

logP(Xθ(t+1))logP(Xθ(t))\log P(\mathbf{X} \mid \theta^{(t+1)}) \geq \log P(\mathbf{X} \mid \theta^{(t)})

This monotonic improvement guarantees convergence to a local maximum (or saddle point) of the likelihood. However, EM does not guarantee finding the global maximum — the result depends on initialization.

Gaussian Mixture Models: The Classic Application

A Gaussian Mixture Model (GMM) assumes data is generated from KK Gaussian components:

P(xθ)=k=1KπkN(xμk,Σk)P(x \mid \theta) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(x \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)

where:

  • πk\pi_k — mixing weights (kπk=1\sum_k \pi_k = 1)
  • μk\boldsymbol{\mu}_k — mean of component kk
  • Σk\boldsymbol{\Sigma}_k — covariance of component kk

The latent variable zi{1,,K}z_i \in \{1, \ldots, K\} indicates which component generated data point xix_i.

E-Step: Compute Responsibilities

For each data point xix_i and component kk, compute the responsibility — the posterior probability that component kk generated xix_i:

γik=P(zi=kxi,θ(t))=πkN(xiμk,Σk)j=1KπjN(xiμj,Σj)\gamma_{ik} = P(z_i = k \mid x_i, \theta^{(t)}) = \frac{\pi_k \, \mathcal{N}(x_i \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^{K} \pi_j \, \mathcal{N}(x_i \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}

This is Bayes’ theorem in action: the prior πk\pi_k is updated by the likelihood N(xiμk,Σk)\mathcal{N}(x_i \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k).

M-Step: Update Parameters

Using the responsibilities as soft assignments, update each component’s parameters:

Nk=i=1nγik(effective number of points in component k)N_k = \sum_{i=1}^{n} \gamma_{ik} \quad \text{(effective number of points in component } k\text{)} μknew=1Nki=1nγikxi\boldsymbol{\mu}_k^{\text{new}} = \frac{1}{N_k} \sum_{i=1}^{n} \gamma_{ik} \, x_i Σknew=1Nki=1nγik(xiμknew)(xiμknew)\boldsymbol{\Sigma}_k^{\text{new}} = \frac{1}{N_k} \sum_{i=1}^{n} \gamma_{ik} (x_i - \boldsymbol{\mu}_k^{\text{new}})(x_i - \boldsymbol{\mu}_k^{\text{new}})^\top πknew=Nkn\pi_k^{\text{new}} = \frac{N_k}{n}

These are weighted versions of the standard MLE formulas — each data point contributes to each component proportionally to its responsibility.

Implementation

import numpy as np
from scipy.stats import multivariate_normal

def em_gmm(X, K, max_iter=100, tol=1e-6):
    n, d = X.shape

    # Initialize
    means = X[np.random.choice(n, K, replace=False)]
    covs = [np.eye(d)] * K
    weights = np.ones(K) / K

    log_likelihood = -np.inf

    for iteration in range(max_iter):
        # E-step: compute responsibilities
        resp = np.zeros((n, K))
        for k in range(K):
            resp[:, k] = weights[k] * multivariate_normal.pdf(X, means[k], covs[k])
        resp /= resp.sum(axis=1, keepdims=True)

        # M-step: update parameters
        Nk = resp.sum(axis=0)
        for k in range(K):
            means[k] = (resp[:, k] @ X) / Nk[k]
            diff = X - means[k]
            covs[k] = (resp[:, k:k+1] * diff).T @ diff / Nk[k]
        weights = Nk / n

        # Check convergence
        new_ll = np.sum(np.log(sum(
            weights[k] * multivariate_normal.pdf(X, means[k], covs[k])
            for k in range(K)
        )))
        if abs(new_ll - log_likelihood) < tol:
            break
        log_likelihood = new_ll

    return means, covs, weights, resp

Why EM Works: The ELBO

The theoretical foundation of EM relies on the Evidence Lower Bound (ELBO). For any distribution q(Z)q(\mathbf{Z}) over the latent variables:

logP(Xθ)=Eq[logP(X,Zθ)]Eq[logq(Z)]ELBO+KL(qP(ZX,θ))\log P(\mathbf{X} \mid \theta) = \underbrace{\mathbb{E}_q[\log P(\mathbf{X}, \mathbf{Z} \mid \theta)] - \mathbb{E}_q[\log q(\mathbf{Z})]}_{\text{ELBO}} + \text{KL}(q \| P(\mathbf{Z} \mid \mathbf{X}, \theta))

Since KL divergence is non-negative, the ELBO is always a lower bound on the log-likelihood.

  • E-step: Set q(Z)=P(ZX,θ(t))q(\mathbf{Z}) = P(\mathbf{Z} \mid \mathbf{X}, \theta^{(t)}), which makes KL = 0 and the ELBO touches the log-likelihood
  • M-step: Maximize the ELBO with respect to θ\theta, which raises the log-likelihood

This perspective connects EM directly to variational inference, which generalizes the E-step to cases where the exact posterior over Z\mathbf{Z} is intractable.

EM for Other Models

Missing Data

If some entries in a dataset are missing, EM treats them as latent variables:

  • E-step: Impute missing values using current parameter estimates
  • M-step: Fit the model on the completed data

This is the principled foundation behind many imputation methods.

Hidden Markov Models (HMMs)

For sequential data with hidden states:

  • E-step: The forward-backward algorithm computes posterior probabilities over hidden state sequences
  • M-step: Update transition and emission probabilities

HMMs are used in speech recognition, bioinformatics, and time series analysis.

Latent Dirichlet Allocation (LDA)

For topic modeling:

  • E-step: Infer document-topic and word-topic distributions
  • M-step: Update global topic-word distributions

In practice, the E-step for LDA is intractable, so variational EM or Gibbs sampling is used instead.

Practical Considerations

Initialization Matters

EM finds local optima, so initialization is critical:

  • K-means++: Run K-means first and use the result to initialize GMM parameters
  • Multiple restarts: Run EM several times with random initializations and keep the best result
  • Hierarchical: Start with fewer components and split

Choosing KK

The number of components must be chosen. Common approaches:

MethodIdea
BIC (Bayesian Information Criterion)2logL+klogn-2 \log L + k \log n — penalizes model complexity
AIC (Akaike Information Criterion)2logL+2k-2 \log L + 2k — less penalty than BIC
Cross-validationEvaluate held-out log-likelihood
Bayesian nonparametricsDirichlet Process mixtures learn KK from data

Singularities

If a Gaussian component collapses onto a single data point, its variance goes to zero and the likelihood goes to infinity. Prevent this by:

  • Adding a small regularization term to the covariance: Σk+ϵI\boldsymbol{\Sigma}_k + \epsilon \mathbf{I}
  • Using Bayesian priors on the covariance (MAP-EM)
  • Monitoring for degenerate components and removing them

Convergence Speed

EM can converge slowly, especially near the optimum. Acceleration techniques:

  • Aitken acceleration: Extrapolate the convergence trajectory
  • ECME (Expectation-Conditional Maximization Either): Partial M-steps
  • Variational EM: Approximate E-step for faster computation

EM vs Gradient-Based Optimization

Why not just use gradient descent on the marginal log-likelihood?

AspectEMGradient Descent
ConstraintsNaturally handles probability constraintsNeeds projections or reparameterization
ConvergenceMonotonic, guaranteed to not decreaseCan oscillate
SpeedCan be slow near optimumAdaptive methods (Adam) often faster
ImplementationClosed-form M-step for many modelsAutomatic differentiation
Local optimaYesYes

In modern deep learning, gradient-based methods with the reparameterization trick (as in VAEs) have largely replaced EM for complex latent variable models. But EM remains the method of choice for classical mixture models and missing data problems.

Summary

  • EM iterates between inferring latent variables (E-step) and optimizing parameters (M-step)
  • Each iteration is guaranteed to improve the log-likelihood
  • GMMs are the canonical application: soft clustering with probabilistic assignments
  • EM is founded on the ELBO, connecting it to variational inference
  • Initialization is critical — use K-means++ or multiple restarts
  • EM handles missing data, HMMs, and topic models
  • EM converges to local optima; for global optimization, use multiple restarts or Bayesian approaches

References

  • Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). “Maximum Likelihood from Incomplete Data via the EM Algorithm.” Journal of the Royal Statistical Society B, 39(1), 1—38.
  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 9.
  • Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 8.
  • McLachlan, G. J., & Krishnan, T. (2008). The EM Algorithm and Extensions (2nd ed.). Wiley.
  • Neal, R. M., & Hinton, G. E. (1998). “A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants.” In Learning in Graphical Models. Springer.

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay