- 01 Probability Fundamentals 02 Random Variables and Expectation 03 Probability Distributions 04 The Exponential Family 05 Convergence and the Central Limit Theorem 06 Maximum Likelihood Estimation 07 MAP Estimation 08 The EM Algorithm 09 Hypothesis Testing 10 Nonparametric Statistics 11 Bayesian Inference 12 Probabilistic Graphical Models 13 Sampling Methods
The Computational Challenge
Bayesian inference requires computing integrals like:
In most real-world models, these integrals are analytically intractable. The posterior doesn’t have a nice closed form, and the parameter space is high-dimensional.
Sampling methods solve this by drawing samples from the posterior and approximating expectations with averages:
By the Law of Large Numbers, this converges to the true expectation as .
Monte Carlo Estimation
The simplest idea: if you can sample from a distribution , you can estimate any expectation by averaging:
Properties
- Unbiased: The estimator is unbiased for any
- Convergence rate: Standard error decreases as , regardless of dimensionality
- Dimension-independent: Unlike grid-based numerical integration (which scales exponentially with dimension), Monte Carlo’s rate doesn’t depend on dimension
Key insight: In 10 dimensions, a grid with 10 points per dimension needs evaluations. Monte Carlo needs the same samples regardless of dimension. This is why Monte Carlo dominates in high-dimensional problems.
Example: Estimating
A classic demonstration. Sample points uniformly in and count how many fall inside the quarter-circle :
import numpy as np
N = 1_000_000
x = np.random.uniform(0, 1, N)
y = np.random.uniform(0, 1, N)
inside = (x**2 + y**2) <= 1
pi_estimate = 4 * np.mean(inside)
print(f"Estimated pi: {pi_estimate:.4f}") # ≈ 3.1416
Rejection Sampling
What if we can’t sample directly from but can evaluate it (up to a normalizing constant)?
Idea: Sample from a simpler proposal distribution and accept/reject samples to correct for the mismatch.
Algorithm
- Choose a proposal and a constant such that for all
- Sample
- Sample
- Accept if , otherwise reject and go to step 2
Accepted samples are distributed according to .
Limitations
- Acceptance rate: — finding a tight is critical
- High dimensions: The acceptance rate drops exponentially with dimension. A well-fitting proposal in 1D might have , but in 100D, could be
- Practical: Rarely used for more than a few dimensions
Importance Sampling
Instead of rejecting samples, reweight them to correct for using the wrong distribution.
To estimate using samples from :
where are the importance weights.
Estimator
Self-Normalized Importance Sampling
When is known only up to a normalizing constant (common in Bayesian inference), use:
Choosing the Proposal
The variance of the importance sampling estimator depends critically on how well matches :
- If is too narrow: some regions of get huge weights, causing high variance
- If is too broad: most weights are tiny, samples are wasted
- Optimal — but this requires knowing what we’re trying to compute
Warning: If where , the estimator is biased. The proposal must have heavier tails than the target.
In ML: Importance sampling is used in off-policy reinforcement learning, rare event simulation, and the training of variational autoencoders.
Markov Chain Monte Carlo (MCMC)
The most important family of sampling methods. MCMC constructs a Markov chain whose stationary distribution is the target distribution .
Core idea: We can’t sample directly from the posterior, but we can design a random walk that, after enough steps, explores the posterior landscape faithfully.
Key Concepts
- Markov chain: A sequence where each state depends only on the previous one:
- Stationary distribution: The distribution such that if , then
- Ergodicity: The chain eventually reaches and stays in the stationary distribution regardless of starting point
- Burn-in: Initial samples before the chain converges — these are discarded
- Mixing: How quickly the chain explores the full distribution
Metropolis-Hastings Algorithm
The foundational MCMC algorithm.
Algorithm
Starting from some :
- Propose: Sample a candidate from a proposal distribution
- Compute acceptance ratio:
- Accept or reject: Sample . If , set . Otherwise,
Key insight: We only need the posterior up to a proportionality constant — the normalizing constant cancels in the ratio. This is what makes MCMC practical for Bayesian inference.
Random Walk Metropolis
The most common variant uses a symmetric proposal . Since is symmetric, the acceptance ratio simplifies to:
The algorithm always moves to higher-probability regions and sometimes moves to lower-probability regions — allowing it to explore the full posterior.
Tuning the Proposal
The step size critically affects performance:
| Step Size | Behavior |
|---|---|
| Too small | High acceptance rate but slow exploration (chain stays local) |
| Too large | Low acceptance rate (most proposals rejected, chain gets stuck) |
| Just right | Acceptance rate ~23-44% (optimal for high-dimensional Gaussians) |
import numpy as np
def metropolis_hastings(log_posterior, theta_init, sigma, n_samples, burn_in=1000):
theta = theta_init.copy()
samples = []
accepted = 0
for i in range(n_samples + burn_in):
theta_star = theta + sigma * np.random.randn(len(theta))
log_alpha = log_posterior(theta_star) - log_posterior(theta)
if np.log(np.random.rand()) < log_alpha:
theta = theta_star
accepted += 1
if i >= burn_in:
samples.append(theta.copy())
print(f"Acceptance rate: {accepted / (n_samples + burn_in):.2%}")
return np.array(samples)
Gibbs Sampling
A special case of MCMC that avoids rejection entirely by sampling each parameter one at a time from its conditional distribution.
Algorithm
For parameters :
At each iteration:
- Sample
- Sample
- …continue for all parameters
Each step samples from a full conditional — the distribution of one parameter given all the others.
When Gibbs Works Well
Gibbs sampling is efficient when:
- The full conditionals have known forms (common with conjugate priors)
- Parameters are not highly correlated (otherwise the chain mixes slowly)
In ML: Gibbs sampling is the standard algorithm for Latent Dirichlet Allocation (LDA) topic models, Gaussian Mixture Models, and many hierarchical Bayesian models. It is natural for inference in graphical models since each variable’s full conditional depends only on its Markov blanket.
When Gibbs Struggles
When parameters are strongly correlated, Gibbs takes tiny steps along the correlation axis and mixes very slowly. The chain needs thousands of iterations to move between the two modes of a banana-shaped posterior, even though a well-tuned Metropolis-Hastings could jump there directly.
Hamiltonian Monte Carlo (HMC)
HMC uses gradient information to make intelligent proposals that explore the posterior much more efficiently than random walks.
The Physics Analogy
Imagine the negative log-posterior as a landscape. HMC simulates a ball rolling on this landscape with conserved energy:
- Position (the parameter)
- Momentum (auxiliary variable)
- Potential energy
- Kinetic energy
The ball follows Hamilton’s equations, which conserve total energy. This creates proposals that are far from the current point but still in high-probability regions.
Why HMC is Better
- No random walk: Moves are directed by the gradient, avoiding the aimless wandering of Metropolis-Hastings
- High acceptance rate: Energy conservation means proposals are usually accepted
- Scales to high dimensions: Much better than random walk MCMC in 100+ dimensions
NUTS (No-U-Turn Sampler)
The main practical challenge with HMC is choosing the number of leapfrog steps. NUTS (Hoffman & Gelman, 2014) automates this by detecting when the trajectory starts turning back on itself.
NUTS is the default algorithm in modern probabilistic programming frameworks like Stan and PyMC.
import pymc as pm
import numpy as np
# Bayesian linear regression with NUTS
data_x = np.random.randn(100)
data_y = 2.5 * data_x + 1.0 + np.random.randn(100) * 0.5
with pm.Model() as model:
w = pm.Normal("w", mu=0, sigma=10)
b = pm.Normal("b", mu=0, sigma=10)
sigma = pm.HalfNormal("sigma", sigma=1)
mu = w * data_x + b
y_obs = pm.Normal("y_obs", mu=mu, sigma=sigma, observed=data_y)
trace = pm.sample(2000, tune=1000) # NUTS by default
print(pm.summary(trace))
Diagnostics: Is My Chain Healthy?
MCMC gives approximate samples. How do you know if the approximation is good?
Trace Plots
Plot parameter values over iterations. A healthy chain looks like “white noise” — no trends, no getting stuck.
(R-hat) Statistic
Run multiple independent chains and compare within-chain variance to between-chain variance:
- : Chains have converged (good)
- : Chains haven’t converged (run longer or fix the model)
Effective Sample Size (ESS)
MCMC samples are autocorrelated — consecutive samples are not independent. ESS estimates how many independent samples the chain is equivalent to:
where is the autocorrelation at lag . Aim for ESS > 400 per parameter.
Divergences (HMC/NUTS)
In HMC, divergent transitions indicate the algorithm is struggling — the numerical integrator can’t follow the posterior geometry. Divergences often signal:
- A need for reparameterization
- A model with pathological geometry (funnels, multimodality)
Comparison of Methods
| Method | Pros | Cons | Dimensions |
|---|---|---|---|
| Rejection sampling | Exact samples, simple | Exponential waste in high-D | 1-5 |
| Importance sampling | No rejection, reweighting | High variance if proposal is poor | 1-20 |
| Metropolis-Hastings | General, flexible | Slow random walk in high-D | 1-50 |
| Gibbs sampling | No rejection, uses conditionals | Slow with correlations | 10-1000 |
| HMC / NUTS | Gradient-guided, efficient | Requires differentiable posterior | 10-10000+ |
| Variational inference | Fast, scalable | Approximate, can underestimate variance | Any |
Summary
- Monte Carlo estimation approximates expectations by averaging over samples
- Rejection sampling draws exact samples but fails in high dimensions
- Importance sampling reweights samples from a proposal distribution
- MCMC constructs Markov chains whose stationary distribution is the target posterior
- Metropolis-Hastings is the foundational MCMC algorithm; tune the step size for ~30% acceptance
- Gibbs sampling updates one parameter at a time from full conditionals
- HMC uses gradients for efficient exploration; NUTS automates its tuning
- Always check diagnostics: trace plots, , ESS, and divergences
- These methods are the computational engine behind modern Bayesian ML
References
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 11.
- Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press. Chapters 11-12.
- Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). Chapman & Hall/CRC. Chapters 11-12.
- Brooks, S., Gelman, A., Jones, G. L., & Meng, X.-L. (2011). Handbook of Markov Chain Monte Carlo. Chapman & Hall/CRC.
- Hoffman, M. D., & Gelman, A. (2014). “The No-U-Turn Sampler.” Journal of Machine Learning Research, 15, 1593—1623. arXiv:1111.4246
- Neal, R. M. (2011). “MCMC Using Hamiltonian Dynamics.” In Handbook of Markov Chain Monte Carlo. Chapman & Hall/CRC. arXiv:1206.1901
- Robert, C. P., & Casella, G. (2004). Monte Carlo Statistical Methods (2nd ed.). Springer.