Sampling Methods

Probability & Statistics Series 13 / 13

The Computational Challenge

Bayesian inference requires computing integrals like:

\mathbb{E}[f(\theta) \mid \mathcal{D}] = \int f(\theta) \, P(\theta \mid \mathcal{D}) \, d\theta

In most real-world models, these integrals are analytically intractable. The posterior $P(\theta \mid \mathcal{D})$ doesn’t have a nice closed form, and the parameter space is high-dimensional.

Sampling methods solve this by drawing samples $\theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(N)}$ from the posterior and approximating expectations with averages:

\mathbb{E}[f(\theta) \mid \mathcal{D}] \approx \frac{1}{N} \sum_{i=1}^{N} f(\theta^{(i)})

By the Law of Large Numbers, this converges to the true expectation as $N \to \infty$ .

Monte Carlo Estimation

The simplest idea: if you can sample from a distribution $P(x)$ , you can estimate any expectation by averaging:

\mathbb{E}[f(X)] = \int f(x) \, P(x) \, dx \approx \frac{1}{N} \sum_{i=1}^{N} f(x^{(i)}), \quad x^{(i)} \sim P(x)

Properties

Unbiased: The estimator is unbiased for any $N$
Convergence rate: Standard error decreases as $O(1/\sqrt{N})$ , regardless of dimensionality
Dimension-independent: Unlike grid-based numerical integration (which scales exponentially with dimension), Monte Carlo’s rate doesn’t depend on dimension

Key insight: In 10 dimensions, a grid with 10 points per dimension needs $10^{10}$ evaluations. Monte Carlo needs the same $N$ samples regardless of dimension. This is why Monte Carlo dominates in high-dimensional problems.

Example: Estimating $\pi$

A classic demonstration. Sample points uniformly in $[0, 1]^2$ and count how many fall inside the quarter-circle $x^2 + y^2 \leq 1$ :

import numpy as np

N = 1_000_000
x = np.random.uniform(0, 1, N)
y = np.random.uniform(0, 1, N)
inside = (x**2 + y**2) <= 1
pi_estimate = 4 * np.mean(inside)

print(f"Estimated pi: {pi_estimate:.4f}")  # ≈ 3.1416

Rejection Sampling

What if we can’t sample directly from $P(x)$ but can evaluate it (up to a normalizing constant)?

Idea: Sample from a simpler proposal distribution $q(x)$ and accept/reject samples to correct for the mismatch.

Algorithm

Choose a proposal $q(x)$ and a constant $M$ such that $M \cdot q(x) \geq P(x)$ for all $x$
Sample $x \sim q(x)$
Sample $u \sim \text{Uniform}(0, 1)$
Accept $x$ if $u \leq \frac{P(x)}{M \cdot q(x)}$ , otherwise reject and go to step 2

Accepted samples are distributed according to $P(x)$ .

Limitations

Acceptance rate: $1/M$ — finding a tight $M$ is critical
High dimensions: The acceptance rate drops exponentially with dimension. A well-fitting proposal in 1D might have $M = 2$ , but in 100D, $M$ could be $2^{100}$
Practical: Rarely used for more than a few dimensions

Importance Sampling

Instead of rejecting samples, reweight them to correct for using the wrong distribution.

To estimate $\mathbb{E}_P[f(X)]$ using samples from $q(x)$ :

\mathbb{E}_P[f(X)] = \int f(x) \frac{P(x)}{q(x)} q(x) \, dx = \mathbb{E}_q\left[f(X) \cdot w(X)\right]

where $w(x) = P(x) / q(x)$ are the importance weights.

Estimator

\hat{\mu} = \frac{1}{N} \sum_{i=1}^{N} f(x^{(i)}) \cdot w(x^{(i)}), \quad x^{(i)} \sim q(x)

Self-Normalized Importance Sampling

When $P(x)$ is known only up to a normalizing constant (common in Bayesian inference), use:

\hat{\mu} = \frac{\sum_{i=1}^{N} f(x^{(i)}) \cdot w(x^{(i)})}{\sum_{i=1}^{N} w(x^{(i)})}

Choosing the Proposal

The variance of the importance sampling estimator depends critically on how well $q$ matches $P$ :

If $q$ is too narrow: some regions of $P$ get huge weights, causing high variance
If $q$ is too broad: most weights are tiny, samples are wasted
Optimal $q \propto |f(x)| \cdot P(x)$ — but this requires knowing what we’re trying to compute

Warning: If $q(x) = 0$ where $P(x) > 0$ , the estimator is biased. The proposal must have heavier tails than the target.

In ML: Importance sampling is used in off-policy reinforcement learning, rare event simulation, and the training of variational autoencoders.

Markov Chain Monte Carlo (MCMC)

The most important family of sampling methods. MCMC constructs a Markov chain whose stationary distribution is the target distribution $P(\theta \mid \mathcal{D})$ .

Core idea: We can’t sample directly from the posterior, but we can design a random walk that, after enough steps, explores the posterior landscape faithfully.

Key Concepts

Markov chain: A sequence where each state depends only on the previous one: $P(\theta^{(t+1)} \mid \theta^{(t)}, \theta^{(t-1)}, \ldots) = P(\theta^{(t+1)} \mid \theta^{(t)})$
Stationary distribution: The distribution $\pi$ such that if $\theta^{(t)} \sim \pi$ , then $\theta^{(t+1)} \sim \pi$
Ergodicity: The chain eventually reaches and stays in the stationary distribution regardless of starting point
Burn-in: Initial samples before the chain converges — these are discarded
Mixing: How quickly the chain explores the full distribution

Metropolis-Hastings Algorithm

The foundational MCMC algorithm.

Algorithm

Starting from some $\theta^{(0)}$ :

Propose: Sample a candidate $\theta^* \sim q(\theta^* \mid \theta^{(t)})$ from a proposal distribution
Compute acceptance ratio:

\alpha = \min\left(1, \frac{P(\theta^* \mid \mathcal{D}) \cdot q(\theta^{(t)} \mid \theta^*)}{P(\theta^{(t)} \mid \mathcal{D}) \cdot q(\theta^* \mid \theta^{(t)})}\right)

Accept or reject: Sample $u \sim \text{Uniform}(0, 1)$ . If $u \leq \alpha$ , set $\theta^{(t+1)} = \theta^*$ . Otherwise, $\theta^{(t+1)} = \theta^{(t)}$

Key insight: We only need the posterior up to a proportionality constant — the normalizing constant $P(\mathcal{D})$ cancels in the ratio. This is what makes MCMC practical for Bayesian inference.

Random Walk Metropolis

The most common variant uses a symmetric proposal $q(\theta^* \mid \theta^{(t)}) = \mathcal{N}(\theta^{(t)}, \sigma^2 I)$ . Since $q$ is symmetric, the acceptance ratio simplifies to:

\alpha = \min\left(1, \frac{P(\theta^* \mid \mathcal{D})}{P(\theta^{(t)} \mid \mathcal{D})}\right)

The algorithm always moves to higher-probability regions and sometimes moves to lower-probability regions — allowing it to explore the full posterior.

Tuning the Proposal

The step size $\sigma$ critically affects performance:

Step Size	Behavior
Too small	High acceptance rate but slow exploration (chain stays local)
Too large	Low acceptance rate (most proposals rejected, chain gets stuck)
Just right	Acceptance rate ~23-44% (optimal for high-dimensional Gaussians)

import numpy as np

def metropolis_hastings(log_posterior, theta_init, sigma, n_samples, burn_in=1000):
    theta = theta_init.copy()
    samples = []
    accepted = 0

    for i in range(n_samples + burn_in):
        theta_star = theta + sigma * np.random.randn(len(theta))
        log_alpha = log_posterior(theta_star) - log_posterior(theta)

        if np.log(np.random.rand()) < log_alpha:
            theta = theta_star
            accepted += 1

        if i >= burn_in:
            samples.append(theta.copy())

    print(f"Acceptance rate: {accepted / (n_samples + burn_in):.2%}")
    return np.array(samples)

Gibbs Sampling

A special case of MCMC that avoids rejection entirely by sampling each parameter one at a time from its conditional distribution.

Algorithm

For parameters $\theta = (\theta_1, \theta_2, \ldots, \theta_d)$ :

At each iteration:

Sample $\theta_1^{(t+1)} \sim P(\theta_1 \mid \theta_2^{(t)}, \theta_3^{(t)}, \ldots, \theta_d^{(t)}, \mathcal{D})$
Sample $\theta_2^{(t+1)} \sim P(\theta_2 \mid \theta_1^{(t+1)}, \theta_3^{(t)}, \ldots, \theta_d^{(t)}, \mathcal{D})$
…continue for all parameters

Each step samples from a full conditional — the distribution of one parameter given all the others.

When Gibbs Works Well

Gibbs sampling is efficient when:

The full conditionals have known forms (common with conjugate priors)
Parameters are not highly correlated (otherwise the chain mixes slowly)

In ML: Gibbs sampling is the standard algorithm for Latent Dirichlet Allocation (LDA) topic models, Gaussian Mixture Models, and many hierarchical Bayesian models. It is natural for inference in graphical models since each variable’s full conditional depends only on its Markov blanket.

When Gibbs Struggles

When parameters are strongly correlated, Gibbs takes tiny steps along the correlation axis and mixes very slowly. The chain needs thousands of iterations to move between the two modes of a banana-shaped posterior, even though a well-tuned Metropolis-Hastings could jump there directly.

Hamiltonian Monte Carlo (HMC)

HMC uses gradient information to make intelligent proposals that explore the posterior much more efficiently than random walks.

The Physics Analogy

Imagine the negative log-posterior as a landscape. HMC simulates a ball rolling on this landscape with conserved energy:

Position $= \theta$ (the parameter)
Momentum $= r$ (auxiliary variable)
Potential energy $= -\log P(\theta \mid \mathcal{D})$
Kinetic energy $= \frac{1}{2} r^\top r$

The ball follows Hamilton’s equations, which conserve total energy. This creates proposals that are far from the current point but still in high-probability regions.

Why HMC is Better

No random walk: Moves are directed by the gradient, avoiding the aimless wandering of Metropolis-Hastings
High acceptance rate: Energy conservation means proposals are usually accepted
Scales to high dimensions: Much better than random walk MCMC in 100+ dimensions

NUTS (No-U-Turn Sampler)

The main practical challenge with HMC is choosing the number of leapfrog steps. NUTS (Hoffman & Gelman, 2014) automates this by detecting when the trajectory starts turning back on itself.

NUTS is the default algorithm in modern probabilistic programming frameworks like Stan and PyMC.

import pymc as pm
import numpy as np

# Bayesian linear regression with NUTS
data_x = np.random.randn(100)
data_y = 2.5 * data_x + 1.0 + np.random.randn(100) * 0.5

with pm.Model() as model:
    w = pm.Normal("w", mu=0, sigma=10)
    b = pm.Normal("b", mu=0, sigma=10)
    sigma = pm.HalfNormal("sigma", sigma=1)

    mu = w * data_x + b
    y_obs = pm.Normal("y_obs", mu=mu, sigma=sigma, observed=data_y)

    trace = pm.sample(2000, tune=1000)  # NUTS by default

print(pm.summary(trace))

Diagnostics: Is My Chain Healthy?

MCMC gives approximate samples. How do you know if the approximation is good?

Trace Plots

Plot parameter values over iterations. A healthy chain looks like “white noise” — no trends, no getting stuck.

$\hat{R}$ (R-hat) Statistic

Run multiple independent chains and compare within-chain variance to between-chain variance:

\hat{R} = \sqrt{\frac{\text{between-chain variance}}{\text{within-chain variance}}}

$\hat{R} \approx 1.0$ : Chains have converged (good)
$\hat{R} > 1.01$ : Chains haven’t converged (run longer or fix the model)

Effective Sample Size (ESS)

MCMC samples are autocorrelated — consecutive samples are not independent. ESS estimates how many independent samples the chain is equivalent to:

\text{ESS} = \frac{N}{1 + 2\sum_{k=1}^{\infty} \rho_k}

where $\rho_k$ is the autocorrelation at lag $k$ . Aim for ESS > 400 per parameter.

Divergences (HMC/NUTS)

In HMC, divergent transitions indicate the algorithm is struggling — the numerical integrator can’t follow the posterior geometry. Divergences often signal:

A need for reparameterization
A model with pathological geometry (funnels, multimodality)

Comparison of Methods

Method	Pros	Cons	Dimensions
Rejection sampling	Exact samples, simple	Exponential waste in high-D	1-5
Importance sampling	No rejection, reweighting	High variance if proposal is poor	1-20
Metropolis-Hastings	General, flexible	Slow random walk in high-D	1-50
Gibbs sampling	No rejection, uses conditionals	Slow with correlations	10-1000
HMC / NUTS	Gradient-guided, efficient	Requires differentiable posterior	10-10000+
Variational inference	Fast, scalable	Approximate, can underestimate variance	Any

Summary

Monte Carlo estimation approximates expectations by averaging over samples
Rejection sampling draws exact samples but fails in high dimensions
Importance sampling reweights samples from a proposal distribution
MCMC constructs Markov chains whose stationary distribution is the target posterior
Metropolis-Hastings is the foundational MCMC algorithm; tune the step size for ~30% acceptance
Gibbs sampling updates one parameter at a time from full conditionals
HMC uses gradients for efficient exploration; NUTS automates its tuning
Always check diagnostics: trace plots, $\hat{R}$ , ESS, and divergences
These methods are the computational engine behind modern Bayesian ML

References

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 11.
Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press. Chapters 11-12.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). Chapman & Hall/CRC. Chapters 11-12.
Brooks, S., Gelman, A., Jones, G. L., & Meng, X.-L. (2011). Handbook of Markov Chain Monte Carlo. Chapman & Hall/CRC.
Hoffman, M. D., & Gelman, A. (2014). “The No-U-Turn Sampler.” Journal of Machine Learning Research, 15, 1593—1623. arXiv:1111.4246
Neal, R. M. (2011). “MCMC Using Hamiltonian Dynamics.” In Handbook of Markov Chain Monte Carlo. Chapman & Hall/CRC. arXiv:1206.1901
Robert, C. P., & Casella, G. (2004). Monte Carlo Statistical Methods (2nd ed.). Springer.

Sampling Methods

The Computational Challenge

Monte Carlo Estimation

Properties

Example: Estimating π\piπ

Rejection Sampling

Algorithm

Limitations

Importance Sampling

Estimator

Self-Normalized Importance Sampling

Choosing the Proposal

Markov Chain Monte Carlo (MCMC)

Key Concepts

Metropolis-Hastings Algorithm

Algorithm

Random Walk Metropolis

Tuning the Proposal

Gibbs Sampling

Algorithm

When Gibbs Works Well

When Gibbs Struggles

Hamiltonian Monte Carlo (HMC)

The Physics Analogy

Why HMC is Better

NUTS (No-U-Turn Sampler)

Diagnostics: Is My Chain Healthy?

Trace Plots

R^\hat{R}R^ (R-hat) Statistic

Effective Sample Size (ESS)

Divergences (HMC/NUTS)

Comparison of Methods

Summary

References

Keyboard Shortcuts

Example: Estimating $\pi$

$\hat{R}$ (R-hat) Statistic