Bayesian Inference

Probability & Statistics Series 11 / 13

Beyond Point Estimates

MLE and MAP give us single best-guess parameter values. But a point estimate throws away valuable information: how uncertain are we?

Full Bayesian inference keeps the entire posterior distribution over parameters. Instead of saying “the coin’s bias is 0.7,” Bayesian inference says “the bias is probably between 0.6 and 0.8, with peak probability around 0.7.” This distinction matters enormously when data is scarce or decisions are high-stakes.

The Bayesian Framework

Start with Bayes’ theorem applied to model parameters $\theta$ and observed data $\mathcal{D}$ :

P(\theta \mid \mathcal{D}) = \frac{P(\mathcal{D} \mid \theta) \cdot P(\theta)}{P(\mathcal{D})}

Each term has a specific role:

Term	Name	Role
$P(\theta \mid \mathcal{D})$	Posterior	Updated belief after seeing data
$P(\mathcal{D} \mid \theta)$	Likelihood	How probable the data is for each $\theta$
$P(\theta)$	Prior	Belief before seeing data
$P(\mathcal{D})$	Evidence (marginal likelihood)	Normalizing constant

The evidence term ensures the posterior integrates to 1:

P(\mathcal{D}) = \int P(\mathcal{D} \mid \theta) \cdot P(\theta) \, d\theta

This integral is why Bayesian inference is computationally challenging — it’s often intractable in high dimensions.

Conjugate Priors

A prior is conjugate to a likelihood if the posterior has the same distributional form as the prior. This gives closed-form updates, avoiding the need for numerical integration.

Likelihood	Conjugate Prior	Posterior
Bernoulli/Binomial	$\text{Beta}(\alpha, \beta)$	$\text{Beta}(\alpha + k, \beta + n - k)$
Poisson	$\text{Gamma}(\alpha, \beta)$	$\text{Gamma}(\alpha + \sum x_i, \beta + n)$
Gaussian (known $\sigma$ )	$\mathcal{N}(\mu_0, \sigma_0^2)$	$\mathcal{N}(\mu_n, \sigma_n^2)$
Gaussian (known $\mu$ )	$\text{Inv-Gamma}(\alpha, \beta)$	$\text{Inv-Gamma}(\alpha', \beta')$
Multinomial	$\text{Dirichlet}(\boldsymbol{\alpha})$	$\text{Dirichlet}(\boldsymbol{\alpha} + \mathbf{c})$

See the distributions article for details on each of these distribution families. The exponential family article explains why conjugate priors exist naturally for this family.

Example: Beta-Binomial

Suppose we want to estimate a coin’s bias $p$ .

Prior: $p \sim \text{Beta}(2, 2)$ — mild belief that the coin is roughly fair.

Data: 7 heads in 10 flips ( $k = 7$ , $n = 10$ ).

Posterior:

p \mid \text{data} \sim \text{Beta}(2 + 7, 2 + 3) = \text{Beta}(9, 5)

The posterior mean:

\mathbb{E}[p \mid \text{data}] = \frac{9}{9 + 5} = 0.643

Compare this to:

MLE: $\hat{p} = 7/10 = 0.700$ (purely data-driven)
MAP: $\hat{p} = (7 + 2 - 1)/(10 + 2 + 2 - 2) = 8/12 = 0.667$ (mode of posterior)
Posterior mean: $0.643$ (mean of posterior, a common Bayesian point estimate)

The posterior mean is shrunk toward the prior mean (0.5) more than MAP, reflecting the full distribution’s shape rather than just its peak.

Example: Gaussian-Gaussian

For Gaussian data with known variance $\sigma^2$ , and a Gaussian prior on the mean:

Prior: $\mu \sim \mathcal{N}(\mu_0, \sigma_0^2)$

Posterior after observing $n$ data points with sample mean $\bar{x}$ :

\mu \mid \text{data} \sim \mathcal{N}(\mu_n, \sigma_n^2)

where:

\mu_n = \frac{\sigma^2 \mu_0 + n \sigma_0^2 \bar{x}}{\sigma^2 + n \sigma_0^2} \qquad \sigma_n^2 = \frac{\sigma^2 \sigma_0^2}{\sigma^2 + n \sigma_0^2}

The posterior mean is a precision-weighted average of the prior mean and the sample mean. The more data you have (larger $n$ ), the closer $\mu_n$ gets to $\bar{x}$ .

Key insight: The posterior precision (inverse variance) is the sum of prior precision and data precision: $1/\sigma_n^2 = 1/\sigma_0^2 + n/\sigma^2$ . Information from independent sources adds.

Posterior Predictive Distribution

Once we have the posterior over parameters, we can make predictions that integrate over our uncertainty:

P(x_{\text{new}} \mid \mathcal{D}) = \int P(x_{\text{new}} \mid \theta) \cdot P(\theta \mid \mathcal{D}) \, d\theta

This is the posterior predictive distribution. Instead of plugging in a single $\hat{\theta}$ , we average predictions over all plausible parameter values, weighted by their posterior probability.

Why This Matters

With a point estimate:

P(x_{\text{new}} \mid \hat{\theta}) = \text{single prediction}

With full Bayesian:

P(x_{\text{new}} \mid \mathcal{D}) = \text{prediction that accounts for parameter uncertainty}

The Bayesian prediction is generally better calibrated — it’s wider when we’re uncertain and narrower when we’re confident.

Example: After 3 coin flips (all heads), MLE predicts the next flip is heads with probability 1.0. The Bayesian predictive (with Beta(1,1) prior) gives $P(\text{heads}) = 4/5 = 0.8$ — more reasonable.

Bayesian Model Comparison

The evidence $P(\mathcal{D})$ — often dismissed as “just a normalizing constant” — is actually the key to Bayesian model comparison.

The Bayes factor compares two models:

\text{BF}_{12} = \frac{P(\mathcal{D} \mid M_1)}{P(\mathcal{D} \mid M_2)} = \frac{\int P(\mathcal{D} \mid \theta_1, M_1) P(\theta_1 \mid M_1) \, d\theta_1}{\int P(\mathcal{D} \mid \theta_2, M_2) P(\theta_2 \mid M_2) \, d\theta_2}

Bayes Factor	Evidence
1 — 3	Barely worth mentioning
3 — 10	Moderate
10 — 30	Strong
30 — 100	Very strong
> 100	Decisive

The Bayes factor naturally penalizes model complexity — a model with more parameters must spread its prior over a larger space, reducing the marginal likelihood unless the data strongly supports it. This is called the Bayesian Occam’s razor.

Bayesian vs Frequentist

These are two fundamentally different philosophies of probability:

Aspect	Frequentist	Bayesian
Probability	Long-run frequency of events	Degree of belief
Parameters	Fixed but unknown	Random variables
Inference	Point estimates + confidence intervals	Posterior distributions
Prior knowledge	Not used	Explicitly encoded
Uncertainty	Via sampling distributions	Via posterior width
Computation	Usually analytical	Often requires MCMC

When to Use Each

Frequentist (MLE, hypothesis tests):

Large datasets where the prior doesn’t matter
When stakeholders expect p-values and confidence intervals
When computational resources are limited

Bayesian:

Small datasets where prior knowledge helps
When you need uncertainty quantification
Sequential decision-making (updating beliefs as data arrives)
When you want to compare models naturally (Bayes factors)

In practice, the distinction blurs. Regularized MLE is MAP estimation. Neural network ensembles approximate Bayesian posteriors. The best practitioners use both frameworks where appropriate.

Hierarchical Bayesian Models

When you have groups of related parameters, hierarchical (multilevel) models share information across groups:

\begin{aligned} \mu_j &\sim \mathcal{N}(\mu_0, \tau^2) \quad \text{(group means drawn from population)} \\ X_{ij} &\sim \mathcal{N}(\mu_j, \sigma^2) \quad \text{(observations within groups)} \end{aligned}

The hyperparameters $\mu_0$ and $\tau^2$ are also given priors and inferred from data. This creates partial pooling: small groups borrow strength from larger groups.

In ML: Hierarchical models appear in:

Transfer learning — sharing parameters across related tasks
Recommender systems — user preferences as random effects
Meta-learning — learning from distributions over tasks
Mixed effects models — clinical trials with multiple sites

Approximate Bayesian Inference

Exact posteriors are rarely available outside conjugate models. Modern Bayesian methods use approximations:

Variational Inference (VI)

Approximate the posterior $P(\theta \mid \mathcal{D})$ with a simpler distribution $q(\theta)$ by minimizing the KL divergence:

q^*(\theta) = \arg\min_{q \in \mathcal{Q}} \text{KL}(q(\theta) \| P(\theta \mid \mathcal{D}))

This converts inference into an optimization problem — something we know how to do efficiently.

VI is used in:

Variational Autoencoders (VAEs) — learning latent representations
Bayesian Neural Networks — uncertainty-aware deep learning
Topic models (LDA) — document-topic distributions

Markov Chain Monte Carlo (MCMC)

Generate samples from the posterior by constructing a Markov chain whose stationary distribution is $P(\theta \mid \mathcal{D})$ . We cover MCMC in depth in the sampling methods article.

Laplace Approximation

Approximate the posterior as a Gaussian centered at the MAP estimate:

P(\theta \mid \mathcal{D}) \approx \mathcal{N}\left(\theta_{\text{MAP}}, \left[-\nabla^2 \log P(\theta \mid \mathcal{D})\big|_{\theta_{\text{MAP}}}\right]^{-1}\right)

The covariance is the inverse Hessian of the negative log-posterior at the MAP point. This is fast but only accurate when the posterior is unimodal and roughly Gaussian.

Bayesian Deep Learning

Applying Bayesian principles to neural networks:

MC Dropout

Gal and Ghahramani (2016) showed that dropout at test time approximates Bayesian inference. Running the network multiple times with different dropout masks gives a distribution of predictions:

# MC Dropout: approximate Bayesian uncertainty
predictions = []
model.train()  # keep dropout active
for _ in range(100):
    pred = model(x_test)
    predictions.append(pred)

mean_pred = torch.stack(predictions).mean(dim=0)
uncertainty = torch.stack(predictions).std(dim=0)

Deep Ensembles

Training multiple networks with different initializations and averaging their predictions approximates a Bayesian posterior. Lakshminarayanan et al. (2017) showed this gives well-calibrated uncertainty estimates.

Summary

Bayesian inference maintains full posterior distributions, not just point estimates
Conjugate priors give closed-form posteriors — Beta-Binomial and Gaussian-Gaussian are the key examples
The posterior predictive integrates over parameter uncertainty for better-calibrated predictions
Bayes factors compare models while naturally penalizing complexity
Hierarchical models share information across groups via partial pooling
When exact inference is intractable, use variational inference, MCMC, or Laplace approximation
In deep learning, MC Dropout and ensembles approximate Bayesian uncertainty
For computational methods that sample from posteriors, see Sampling Methods

References

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). Chapman & Hall/CRC.
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapters 4-5.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapters 2-3.
Gal, Y., & Ghahramani, Z. (2016). “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” ICML 2016. arXiv:1506.02142
Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). “Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles.” NeurIPS 2017. arXiv:1612.01474
Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). “Variational Inference: A Review for Statisticians.” Journal of the American Statistical Association, 112(518), 859—877. arXiv:1601.00670