- 01 Probability Fundamentals 02 Random Variables and Expectation 03 Probability Distributions 04 The Exponential Family 05 Convergence and the Central Limit Theorem 06 Maximum Likelihood Estimation 07 MAP Estimation 08 The EM Algorithm 09 Hypothesis Testing 10 Nonparametric Statistics 11 Bayesian Inference 12 Probabilistic Graphical Models 13 Sampling Methods
The Latent Variable Problem
Maximum Likelihood Estimation works beautifully when we observe all the variables in our model. But many real-world problems involve latent (hidden) variables — quantities we can’t observe directly but that explain the structure of our data.
Consider clustering: we observe data points but not which cluster each point belongs to. The cluster assignments are latent variables. We can’t directly maximize the likelihood because it involves summing over all possible assignments of these hidden variables — an intractable computation.
The Expectation-Maximization (EM) algorithm solves this by iterating between two steps: inferring the latent variables (E-step) and updating the parameters (M-step).
The Setup
We have:
- Observed data:
- Latent variables:
- Parameters:
The complete-data log-likelihood — what we’d maximize if we could see everything — is:
But we only observe , so we need the marginal (incomplete) log-likelihood:
The sum (or integral) inside the log makes direct optimization difficult. EM sidesteps this.
The Algorithm
Starting from an initial guess , iterate:
E-Step (Expectation)
Compute the expected complete-data log-likelihood, using the current parameters to “fill in” the latent variables:
This computes the posterior distribution over latent variables and uses it to take an expectation.
M-Step (Maximization)
Maximize with respect to the parameters:
This is usually a standard MLE problem on the “completed” data, which is often much easier than the original problem.
Convergence Guarantee
EM never decreases the marginal log-likelihood:
This monotonic improvement guarantees convergence to a local maximum (or saddle point) of the likelihood. However, EM does not guarantee finding the global maximum — the result depends on initialization.
Gaussian Mixture Models: The Classic Application
A Gaussian Mixture Model (GMM) assumes data is generated from Gaussian components:
where:
- — mixing weights ()
- — mean of component
- — covariance of component
The latent variable indicates which component generated data point .
E-Step: Compute Responsibilities
For each data point and component , compute the responsibility — the posterior probability that component generated :
This is Bayes’ theorem in action: the prior is updated by the likelihood .
M-Step: Update Parameters
Using the responsibilities as soft assignments, update each component’s parameters:
These are weighted versions of the standard MLE formulas — each data point contributes to each component proportionally to its responsibility.
Implementation
import numpy as np
from scipy.stats import multivariate_normal
def em_gmm(X, K, max_iter=100, tol=1e-6):
n, d = X.shape
# Initialize
means = X[np.random.choice(n, K, replace=False)]
covs = [np.eye(d)] * K
weights = np.ones(K) / K
log_likelihood = -np.inf
for iteration in range(max_iter):
# E-step: compute responsibilities
resp = np.zeros((n, K))
for k in range(K):
resp[:, k] = weights[k] * multivariate_normal.pdf(X, means[k], covs[k])
resp /= resp.sum(axis=1, keepdims=True)
# M-step: update parameters
Nk = resp.sum(axis=0)
for k in range(K):
means[k] = (resp[:, k] @ X) / Nk[k]
diff = X - means[k]
covs[k] = (resp[:, k:k+1] * diff).T @ diff / Nk[k]
weights = Nk / n
# Check convergence
new_ll = np.sum(np.log(sum(
weights[k] * multivariate_normal.pdf(X, means[k], covs[k])
for k in range(K)
)))
if abs(new_ll - log_likelihood) < tol:
break
log_likelihood = new_ll
return means, covs, weights, resp
Why EM Works: The ELBO
The theoretical foundation of EM relies on the Evidence Lower Bound (ELBO). For any distribution over the latent variables:
Since KL divergence is non-negative, the ELBO is always a lower bound on the log-likelihood.
- E-step: Set , which makes KL = 0 and the ELBO touches the log-likelihood
- M-step: Maximize the ELBO with respect to , which raises the log-likelihood
This perspective connects EM directly to variational inference, which generalizes the E-step to cases where the exact posterior over is intractable.
EM for Other Models
Missing Data
If some entries in a dataset are missing, EM treats them as latent variables:
- E-step: Impute missing values using current parameter estimates
- M-step: Fit the model on the completed data
This is the principled foundation behind many imputation methods.
Hidden Markov Models (HMMs)
For sequential data with hidden states:
- E-step: The forward-backward algorithm computes posterior probabilities over hidden state sequences
- M-step: Update transition and emission probabilities
HMMs are used in speech recognition, bioinformatics, and time series analysis.
Latent Dirichlet Allocation (LDA)
For topic modeling:
- E-step: Infer document-topic and word-topic distributions
- M-step: Update global topic-word distributions
In practice, the E-step for LDA is intractable, so variational EM or Gibbs sampling is used instead.
Practical Considerations
Initialization Matters
EM finds local optima, so initialization is critical:
- K-means++: Run K-means first and use the result to initialize GMM parameters
- Multiple restarts: Run EM several times with random initializations and keep the best result
- Hierarchical: Start with fewer components and split
Choosing
The number of components must be chosen. Common approaches:
| Method | Idea |
|---|---|
| BIC (Bayesian Information Criterion) | — penalizes model complexity |
| AIC (Akaike Information Criterion) | — less penalty than BIC |
| Cross-validation | Evaluate held-out log-likelihood |
| Bayesian nonparametrics | Dirichlet Process mixtures learn from data |
Singularities
If a Gaussian component collapses onto a single data point, its variance goes to zero and the likelihood goes to infinity. Prevent this by:
- Adding a small regularization term to the covariance:
- Using Bayesian priors on the covariance (MAP-EM)
- Monitoring for degenerate components and removing them
Convergence Speed
EM can converge slowly, especially near the optimum. Acceleration techniques:
- Aitken acceleration: Extrapolate the convergence trajectory
- ECME (Expectation-Conditional Maximization Either): Partial M-steps
- Variational EM: Approximate E-step for faster computation
EM vs Gradient-Based Optimization
Why not just use gradient descent on the marginal log-likelihood?
| Aspect | EM | Gradient Descent |
|---|---|---|
| Constraints | Naturally handles probability constraints | Needs projections or reparameterization |
| Convergence | Monotonic, guaranteed to not decrease | Can oscillate |
| Speed | Can be slow near optimum | Adaptive methods (Adam) often faster |
| Implementation | Closed-form M-step for many models | Automatic differentiation |
| Local optima | Yes | Yes |
In modern deep learning, gradient-based methods with the reparameterization trick (as in VAEs) have largely replaced EM for complex latent variable models. But EM remains the method of choice for classical mixture models and missing data problems.
Summary
- EM iterates between inferring latent variables (E-step) and optimizing parameters (M-step)
- Each iteration is guaranteed to improve the log-likelihood
- GMMs are the canonical application: soft clustering with probabilistic assignments
- EM is founded on the ELBO, connecting it to variational inference
- Initialization is critical — use K-means++ or multiple restarts
- EM handles missing data, HMMs, and topic models
- EM converges to local optima; for global optimization, use multiple restarts or Bayesian approaches
References
- Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). “Maximum Likelihood from Incomplete Data via the EM Algorithm.” Journal of the Royal Statistical Society B, 39(1), 1—38.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 9.
- Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 8.
- McLachlan, G. J., & Krishnan, T. (2008). The EM Algorithm and Extensions (2nd ed.). Wiley.
- Neal, R. M., & Hinton, G. E. (1998). “A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants.” In Learning in Graphical Models. Springer.