Sampling Methods

Monte Carlo, rejection sampling, importance sampling, MCMC, Metropolis-Hastings, and Gibbs sampling for computational inference.

Probability & Statistics March 6, 2026 10 min read

The Computational Challenge

Bayesian inference requires computing integrals like:

E[f(θ)D]=f(θ)P(θD)dθ\mathbb{E}[f(\theta) \mid \mathcal{D}] = \int f(\theta) \, P(\theta \mid \mathcal{D}) \, d\theta

In most real-world models, these integrals are analytically intractable. The posterior P(θD)P(\theta \mid \mathcal{D}) doesn’t have a nice closed form, and the parameter space is high-dimensional.

Sampling methods solve this by drawing samples θ(1),θ(2),,θ(N)\theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(N)} from the posterior and approximating expectations with averages:

E[f(θ)D]1Ni=1Nf(θ(i))\mathbb{E}[f(\theta) \mid \mathcal{D}] \approx \frac{1}{N} \sum_{i=1}^{N} f(\theta^{(i)})

By the Law of Large Numbers, this converges to the true expectation as NN \to \infty.

Monte Carlo Estimation

The simplest idea: if you can sample from a distribution P(x)P(x), you can estimate any expectation by averaging:

E[f(X)]=f(x)P(x)dx1Ni=1Nf(x(i)),x(i)P(x)\mathbb{E}[f(X)] = \int f(x) \, P(x) \, dx \approx \frac{1}{N} \sum_{i=1}^{N} f(x^{(i)}), \quad x^{(i)} \sim P(x)

Properties

  • Unbiased: The estimator is unbiased for any NN
  • Convergence rate: Standard error decreases as O(1/N)O(1/\sqrt{N}), regardless of dimensionality
  • Dimension-independent: Unlike grid-based numerical integration (which scales exponentially with dimension), Monte Carlo’s rate doesn’t depend on dimension

Key insight: In 10 dimensions, a grid with 10 points per dimension needs 101010^{10} evaluations. Monte Carlo needs the same NN samples regardless of dimension. This is why Monte Carlo dominates in high-dimensional problems.

Example: Estimating π\pi

A classic demonstration. Sample points uniformly in [0,1]2[0, 1]^2 and count how many fall inside the quarter-circle x2+y21x^2 + y^2 \leq 1:

import numpy as np

N = 1_000_000
x = np.random.uniform(0, 1, N)
y = np.random.uniform(0, 1, N)
inside = (x**2 + y**2) <= 1
pi_estimate = 4 * np.mean(inside)

print(f"Estimated pi: {pi_estimate:.4f}")  # ≈ 3.1416

Rejection Sampling

What if we can’t sample directly from P(x)P(x) but can evaluate it (up to a normalizing constant)?

Idea: Sample from a simpler proposal distribution q(x)q(x) and accept/reject samples to correct for the mismatch.

Algorithm

  1. Choose a proposal q(x)q(x) and a constant MM such that Mq(x)P(x)M \cdot q(x) \geq P(x) for all xx
  2. Sample xq(x)x \sim q(x)
  3. Sample uUniform(0,1)u \sim \text{Uniform}(0, 1)
  4. Accept xx if uP(x)Mq(x)u \leq \frac{P(x)}{M \cdot q(x)}, otherwise reject and go to step 2

Accepted samples are distributed according to P(x)P(x).

Limitations

  • Acceptance rate: 1/M1/M — finding a tight MM is critical
  • High dimensions: The acceptance rate drops exponentially with dimension. A well-fitting proposal in 1D might have M=2M = 2, but in 100D, MM could be 21002^{100}
  • Practical: Rarely used for more than a few dimensions

Importance Sampling

Instead of rejecting samples, reweight them to correct for using the wrong distribution.

To estimate EP[f(X)]\mathbb{E}_P[f(X)] using samples from q(x)q(x):

EP[f(X)]=f(x)P(x)q(x)q(x)dx=Eq[f(X)w(X)]\mathbb{E}_P[f(X)] = \int f(x) \frac{P(x)}{q(x)} q(x) \, dx = \mathbb{E}_q\left[f(X) \cdot w(X)\right]

where w(x)=P(x)/q(x)w(x) = P(x) / q(x) are the importance weights.

Estimator

μ^=1Ni=1Nf(x(i))w(x(i)),x(i)q(x)\hat{\mu} = \frac{1}{N} \sum_{i=1}^{N} f(x^{(i)}) \cdot w(x^{(i)}), \quad x^{(i)} \sim q(x)

Self-Normalized Importance Sampling

When P(x)P(x) is known only up to a normalizing constant (common in Bayesian inference), use:

μ^=i=1Nf(x(i))w(x(i))i=1Nw(x(i))\hat{\mu} = \frac{\sum_{i=1}^{N} f(x^{(i)}) \cdot w(x^{(i)})}{\sum_{i=1}^{N} w(x^{(i)})}

Choosing the Proposal

The variance of the importance sampling estimator depends critically on how well qq matches PP:

  • If qq is too narrow: some regions of PP get huge weights, causing high variance
  • If qq is too broad: most weights are tiny, samples are wasted
  • Optimal qf(x)P(x)q \propto |f(x)| \cdot P(x) — but this requires knowing what we’re trying to compute

Warning: If q(x)=0q(x) = 0 where P(x)>0P(x) > 0, the estimator is biased. The proposal must have heavier tails than the target.

In ML: Importance sampling is used in off-policy reinforcement learning, rare event simulation, and the training of variational autoencoders.

Markov Chain Monte Carlo (MCMC)

The most important family of sampling methods. MCMC constructs a Markov chain whose stationary distribution is the target distribution P(θD)P(\theta \mid \mathcal{D}).

Core idea: We can’t sample directly from the posterior, but we can design a random walk that, after enough steps, explores the posterior landscape faithfully.

Key Concepts

  • Markov chain: A sequence where each state depends only on the previous one: P(θ(t+1)θ(t),θ(t1),)=P(θ(t+1)θ(t))P(\theta^{(t+1)} \mid \theta^{(t)}, \theta^{(t-1)}, \ldots) = P(\theta^{(t+1)} \mid \theta^{(t)})
  • Stationary distribution: The distribution π\pi such that if θ(t)π\theta^{(t)} \sim \pi, then θ(t+1)π\theta^{(t+1)} \sim \pi
  • Ergodicity: The chain eventually reaches and stays in the stationary distribution regardless of starting point
  • Burn-in: Initial samples before the chain converges — these are discarded
  • Mixing: How quickly the chain explores the full distribution

Metropolis-Hastings Algorithm

The foundational MCMC algorithm.

Algorithm

Starting from some θ(0)\theta^{(0)}:

  1. Propose: Sample a candidate θq(θθ(t))\theta^* \sim q(\theta^* \mid \theta^{(t)}) from a proposal distribution
  2. Compute acceptance ratio:
α=min(1,P(θD)q(θ(t)θ)P(θ(t)D)q(θθ(t)))\alpha = \min\left(1, \frac{P(\theta^* \mid \mathcal{D}) \cdot q(\theta^{(t)} \mid \theta^*)}{P(\theta^{(t)} \mid \mathcal{D}) \cdot q(\theta^* \mid \theta^{(t)})}\right)
  1. Accept or reject: Sample uUniform(0,1)u \sim \text{Uniform}(0, 1). If uαu \leq \alpha, set θ(t+1)=θ\theta^{(t+1)} = \theta^*. Otherwise, θ(t+1)=θ(t)\theta^{(t+1)} = \theta^{(t)}

Key insight: We only need the posterior up to a proportionality constant — the normalizing constant P(D)P(\mathcal{D}) cancels in the ratio. This is what makes MCMC practical for Bayesian inference.

Random Walk Metropolis

The most common variant uses a symmetric proposal q(θθ(t))=N(θ(t),σ2I)q(\theta^* \mid \theta^{(t)}) = \mathcal{N}(\theta^{(t)}, \sigma^2 I). Since qq is symmetric, the acceptance ratio simplifies to:

α=min(1,P(θD)P(θ(t)D))\alpha = \min\left(1, \frac{P(\theta^* \mid \mathcal{D})}{P(\theta^{(t)} \mid \mathcal{D})}\right)

The algorithm always moves to higher-probability regions and sometimes moves to lower-probability regions — allowing it to explore the full posterior.

Tuning the Proposal

The step size σ\sigma critically affects performance:

Step SizeBehavior
Too smallHigh acceptance rate but slow exploration (chain stays local)
Too largeLow acceptance rate (most proposals rejected, chain gets stuck)
Just rightAcceptance rate ~23-44% (optimal for high-dimensional Gaussians)
import numpy as np

def metropolis_hastings(log_posterior, theta_init, sigma, n_samples, burn_in=1000):
    theta = theta_init.copy()
    samples = []
    accepted = 0

    for i in range(n_samples + burn_in):
        theta_star = theta + sigma * np.random.randn(len(theta))
        log_alpha = log_posterior(theta_star) - log_posterior(theta)

        if np.log(np.random.rand()) < log_alpha:
            theta = theta_star
            accepted += 1

        if i >= burn_in:
            samples.append(theta.copy())

    print(f"Acceptance rate: {accepted / (n_samples + burn_in):.2%}")
    return np.array(samples)

Gibbs Sampling

A special case of MCMC that avoids rejection entirely by sampling each parameter one at a time from its conditional distribution.

Algorithm

For parameters θ=(θ1,θ2,,θd)\theta = (\theta_1, \theta_2, \ldots, \theta_d):

At each iteration:

  1. Sample θ1(t+1)P(θ1θ2(t),θ3(t),,θd(t),D)\theta_1^{(t+1)} \sim P(\theta_1 \mid \theta_2^{(t)}, \theta_3^{(t)}, \ldots, \theta_d^{(t)}, \mathcal{D})
  2. Sample θ2(t+1)P(θ2θ1(t+1),θ3(t),,θd(t),D)\theta_2^{(t+1)} \sim P(\theta_2 \mid \theta_1^{(t+1)}, \theta_3^{(t)}, \ldots, \theta_d^{(t)}, \mathcal{D})
  3. …continue for all parameters

Each step samples from a full conditional — the distribution of one parameter given all the others.

When Gibbs Works Well

Gibbs sampling is efficient when:

  • The full conditionals have known forms (common with conjugate priors)
  • Parameters are not highly correlated (otherwise the chain mixes slowly)

In ML: Gibbs sampling is the standard algorithm for Latent Dirichlet Allocation (LDA) topic models, Gaussian Mixture Models, and many hierarchical Bayesian models. It is natural for inference in graphical models since each variable’s full conditional depends only on its Markov blanket.

When Gibbs Struggles

When parameters are strongly correlated, Gibbs takes tiny steps along the correlation axis and mixes very slowly. The chain needs thousands of iterations to move between the two modes of a banana-shaped posterior, even though a well-tuned Metropolis-Hastings could jump there directly.

Hamiltonian Monte Carlo (HMC)

HMC uses gradient information to make intelligent proposals that explore the posterior much more efficiently than random walks.

The Physics Analogy

Imagine the negative log-posterior as a landscape. HMC simulates a ball rolling on this landscape with conserved energy:

  • Position =θ= \theta (the parameter)
  • Momentum =r= r (auxiliary variable)
  • Potential energy =logP(θD)= -\log P(\theta \mid \mathcal{D})
  • Kinetic energy =12rr= \frac{1}{2} r^\top r

The ball follows Hamilton’s equations, which conserve total energy. This creates proposals that are far from the current point but still in high-probability regions.

Why HMC is Better

  • No random walk: Moves are directed by the gradient, avoiding the aimless wandering of Metropolis-Hastings
  • High acceptance rate: Energy conservation means proposals are usually accepted
  • Scales to high dimensions: Much better than random walk MCMC in 100+ dimensions

NUTS (No-U-Turn Sampler)

The main practical challenge with HMC is choosing the number of leapfrog steps. NUTS (Hoffman & Gelman, 2014) automates this by detecting when the trajectory starts turning back on itself.

NUTS is the default algorithm in modern probabilistic programming frameworks like Stan and PyMC.

import pymc as pm
import numpy as np

# Bayesian linear regression with NUTS
data_x = np.random.randn(100)
data_y = 2.5 * data_x + 1.0 + np.random.randn(100) * 0.5

with pm.Model() as model:
    w = pm.Normal("w", mu=0, sigma=10)
    b = pm.Normal("b", mu=0, sigma=10)
    sigma = pm.HalfNormal("sigma", sigma=1)

    mu = w * data_x + b
    y_obs = pm.Normal("y_obs", mu=mu, sigma=sigma, observed=data_y)

    trace = pm.sample(2000, tune=1000)  # NUTS by default

print(pm.summary(trace))

Diagnostics: Is My Chain Healthy?

MCMC gives approximate samples. How do you know if the approximation is good?

Trace Plots

Plot parameter values over iterations. A healthy chain looks like “white noise” — no trends, no getting stuck.

R^\hat{R} (R-hat) Statistic

Run multiple independent chains and compare within-chain variance to between-chain variance:

R^=between-chain variancewithin-chain variance\hat{R} = \sqrt{\frac{\text{between-chain variance}}{\text{within-chain variance}}}
  • R^1.0\hat{R} \approx 1.0: Chains have converged (good)
  • R^>1.01\hat{R} > 1.01: Chains haven’t converged (run longer or fix the model)

Effective Sample Size (ESS)

MCMC samples are autocorrelated — consecutive samples are not independent. ESS estimates how many independent samples the chain is equivalent to:

ESS=N1+2k=1ρk\text{ESS} = \frac{N}{1 + 2\sum_{k=1}^{\infty} \rho_k}

where ρk\rho_k is the autocorrelation at lag kk. Aim for ESS > 400 per parameter.

Divergences (HMC/NUTS)

In HMC, divergent transitions indicate the algorithm is struggling — the numerical integrator can’t follow the posterior geometry. Divergences often signal:

  • A need for reparameterization
  • A model with pathological geometry (funnels, multimodality)

Comparison of Methods

MethodProsConsDimensions
Rejection samplingExact samples, simpleExponential waste in high-D1-5
Importance samplingNo rejection, reweightingHigh variance if proposal is poor1-20
Metropolis-HastingsGeneral, flexibleSlow random walk in high-D1-50
Gibbs samplingNo rejection, uses conditionalsSlow with correlations10-1000
HMC / NUTSGradient-guided, efficientRequires differentiable posterior10-10000+
Variational inferenceFast, scalableApproximate, can underestimate varianceAny

Summary

  • Monte Carlo estimation approximates expectations by averaging over samples
  • Rejection sampling draws exact samples but fails in high dimensions
  • Importance sampling reweights samples from a proposal distribution
  • MCMC constructs Markov chains whose stationary distribution is the target posterior
  • Metropolis-Hastings is the foundational MCMC algorithm; tune the step size for ~30% acceptance
  • Gibbs sampling updates one parameter at a time from full conditionals
  • HMC uses gradients for efficient exploration; NUTS automates its tuning
  • Always check diagnostics: trace plots, R^\hat{R}, ESS, and divergences
  • These methods are the computational engine behind modern Bayesian ML

References

  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 11.
  • Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press. Chapters 11-12.
  • Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). Chapman & Hall/CRC. Chapters 11-12.
  • Brooks, S., Gelman, A., Jones, G. L., & Meng, X.-L. (2011). Handbook of Markov Chain Monte Carlo. Chapman & Hall/CRC.
  • Hoffman, M. D., & Gelman, A. (2014). “The No-U-Turn Sampler.” Journal of Machine Learning Research, 15, 1593—1623. arXiv:1111.4246
  • Neal, R. M. (2011). “MCMC Using Hamiltonian Dynamics.” In Handbook of Markov Chain Monte Carlo. Chapman & Hall/CRC. arXiv:1206.1901
  • Robert, C. P., & Casella, G. (2004). Monte Carlo Statistical Methods (2nd ed.). Springer.

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay