Bayesian Inference

Full Bayesian reasoning: posterior distributions, conjugate priors, predictive distributions, and how Bayesian methods differ from frequentist approaches.

Probability & Statistics March 6, 2026 9 min read

Beyond Point Estimates

MLE and MAP give us single best-guess parameter values. But a point estimate throws away valuable information: how uncertain are we?

Full Bayesian inference keeps the entire posterior distribution over parameters. Instead of saying “the coin’s bias is 0.7,” Bayesian inference says “the bias is probably between 0.6 and 0.8, with peak probability around 0.7.” This distinction matters enormously when data is scarce or decisions are high-stakes.

The Bayesian Framework

Start with Bayes’ theorem applied to model parameters θ\theta and observed data D\mathcal{D}:

P(θD)=P(Dθ)P(θ)P(D)P(\theta \mid \mathcal{D}) = \frac{P(\mathcal{D} \mid \theta) \cdot P(\theta)}{P(\mathcal{D})}

Each term has a specific role:

TermNameRole
P(θD)P(\theta \mid \mathcal{D})PosteriorUpdated belief after seeing data
P(Dθ)P(\mathcal{D} \mid \theta)LikelihoodHow probable the data is for each θ\theta
P(θ)P(\theta)PriorBelief before seeing data
P(D)P(\mathcal{D})Evidence (marginal likelihood)Normalizing constant

The evidence term ensures the posterior integrates to 1:

P(D)=P(Dθ)P(θ)dθP(\mathcal{D}) = \int P(\mathcal{D} \mid \theta) \cdot P(\theta) \, d\theta

This integral is why Bayesian inference is computationally challenging — it’s often intractable in high dimensions.

Conjugate Priors

A prior is conjugate to a likelihood if the posterior has the same distributional form as the prior. This gives closed-form updates, avoiding the need for numerical integration.

LikelihoodConjugate PriorPosterior
Bernoulli/BinomialBeta(α,β)\text{Beta}(\alpha, \beta)Beta(α+k,β+nk)\text{Beta}(\alpha + k, \beta + n - k)
PoissonGamma(α,β)\text{Gamma}(\alpha, \beta)Gamma(α+xi,β+n)\text{Gamma}(\alpha + \sum x_i, \beta + n)
Gaussian (known σ\sigma)N(μ0,σ02)\mathcal{N}(\mu_0, \sigma_0^2)N(μn,σn2)\mathcal{N}(\mu_n, \sigma_n^2)
Gaussian (known μ\mu)Inv-Gamma(α,β)\text{Inv-Gamma}(\alpha, \beta)Inv-Gamma(α,β)\text{Inv-Gamma}(\alpha', \beta')
MultinomialDirichlet(α)\text{Dirichlet}(\boldsymbol{\alpha})Dirichlet(α+c)\text{Dirichlet}(\boldsymbol{\alpha} + \mathbf{c})

See the distributions article for details on each of these distribution families. The exponential family article explains why conjugate priors exist naturally for this family.

Example: Beta-Binomial

Suppose we want to estimate a coin’s bias pp.

Prior: pBeta(2,2)p \sim \text{Beta}(2, 2) — mild belief that the coin is roughly fair.

Data: 7 heads in 10 flips (k=7k = 7, n=10n = 10).

Posterior:

pdataBeta(2+7,2+3)=Beta(9,5)p \mid \text{data} \sim \text{Beta}(2 + 7, 2 + 3) = \text{Beta}(9, 5)

The posterior mean:

E[pdata]=99+5=0.643\mathbb{E}[p \mid \text{data}] = \frac{9}{9 + 5} = 0.643

Compare this to:

  • MLE: p^=7/10=0.700\hat{p} = 7/10 = 0.700 (purely data-driven)
  • MAP: p^=(7+21)/(10+2+22)=8/12=0.667\hat{p} = (7 + 2 - 1)/(10 + 2 + 2 - 2) = 8/12 = 0.667 (mode of posterior)
  • Posterior mean: 0.6430.643 (mean of posterior, a common Bayesian point estimate)

The posterior mean is shrunk toward the prior mean (0.5) more than MAP, reflecting the full distribution’s shape rather than just its peak.

Example: Gaussian-Gaussian

For Gaussian data with known variance σ2\sigma^2, and a Gaussian prior on the mean:

Prior: μN(μ0,σ02)\mu \sim \mathcal{N}(\mu_0, \sigma_0^2)

Posterior after observing nn data points with sample mean xˉ\bar{x}:

μdataN(μn,σn2)\mu \mid \text{data} \sim \mathcal{N}(\mu_n, \sigma_n^2)

where:

μn=σ2μ0+nσ02xˉσ2+nσ02σn2=σ2σ02σ2+nσ02\mu_n = \frac{\sigma^2 \mu_0 + n \sigma_0^2 \bar{x}}{\sigma^2 + n \sigma_0^2} \qquad \sigma_n^2 = \frac{\sigma^2 \sigma_0^2}{\sigma^2 + n \sigma_0^2}

The posterior mean is a precision-weighted average of the prior mean and the sample mean. The more data you have (larger nn), the closer μn\mu_n gets to xˉ\bar{x}.

Key insight: The posterior precision (inverse variance) is the sum of prior precision and data precision: 1/σn2=1/σ02+n/σ21/\sigma_n^2 = 1/\sigma_0^2 + n/\sigma^2. Information from independent sources adds.

Posterior Predictive Distribution

Once we have the posterior over parameters, we can make predictions that integrate over our uncertainty:

P(xnewD)=P(xnewθ)P(θD)dθP(x_{\text{new}} \mid \mathcal{D}) = \int P(x_{\text{new}} \mid \theta) \cdot P(\theta \mid \mathcal{D}) \, d\theta

This is the posterior predictive distribution. Instead of plugging in a single θ^\hat{\theta}, we average predictions over all plausible parameter values, weighted by their posterior probability.

Why This Matters

With a point estimate:

P(xnewθ^)=single predictionP(x_{\text{new}} \mid \hat{\theta}) = \text{single prediction}

With full Bayesian:

P(xnewD)=prediction that accounts for parameter uncertaintyP(x_{\text{new}} \mid \mathcal{D}) = \text{prediction that accounts for parameter uncertainty}

The Bayesian prediction is generally better calibrated — it’s wider when we’re uncertain and narrower when we’re confident.

Example: After 3 coin flips (all heads), MLE predicts the next flip is heads with probability 1.0. The Bayesian predictive (with Beta(1,1) prior) gives P(heads)=4/5=0.8P(\text{heads}) = 4/5 = 0.8 — more reasonable.

Bayesian Model Comparison

The evidence P(D)P(\mathcal{D}) — often dismissed as “just a normalizing constant” — is actually the key to Bayesian model comparison.

The Bayes factor compares two models:

BF12=P(DM1)P(DM2)=P(Dθ1,M1)P(θ1M1)dθ1P(Dθ2,M2)P(θ2M2)dθ2\text{BF}_{12} = \frac{P(\mathcal{D} \mid M_1)}{P(\mathcal{D} \mid M_2)} = \frac{\int P(\mathcal{D} \mid \theta_1, M_1) P(\theta_1 \mid M_1) \, d\theta_1}{\int P(\mathcal{D} \mid \theta_2, M_2) P(\theta_2 \mid M_2) \, d\theta_2}
Bayes FactorEvidence
1 — 3Barely worth mentioning
3 — 10Moderate
10 — 30Strong
30 — 100Very strong
> 100Decisive

The Bayes factor naturally penalizes model complexity — a model with more parameters must spread its prior over a larger space, reducing the marginal likelihood unless the data strongly supports it. This is called the Bayesian Occam’s razor.

Bayesian vs Frequentist

These are two fundamentally different philosophies of probability:

AspectFrequentistBayesian
ProbabilityLong-run frequency of eventsDegree of belief
ParametersFixed but unknownRandom variables
InferencePoint estimates + confidence intervalsPosterior distributions
Prior knowledgeNot usedExplicitly encoded
UncertaintyVia sampling distributionsVia posterior width
ComputationUsually analyticalOften requires MCMC

When to Use Each

Frequentist (MLE, hypothesis tests):

  • Large datasets where the prior doesn’t matter
  • When stakeholders expect p-values and confidence intervals
  • When computational resources are limited

Bayesian:

  • Small datasets where prior knowledge helps
  • When you need uncertainty quantification
  • Sequential decision-making (updating beliefs as data arrives)
  • When you want to compare models naturally (Bayes factors)

In practice, the distinction blurs. Regularized MLE is MAP estimation. Neural network ensembles approximate Bayesian posteriors. The best practitioners use both frameworks where appropriate.

Hierarchical Bayesian Models

When you have groups of related parameters, hierarchical (multilevel) models share information across groups:

μjN(μ0,τ2)(group means drawn from population)XijN(μj,σ2)(observations within groups)\begin{aligned} \mu_j &\sim \mathcal{N}(\mu_0, \tau^2) \quad \text{(group means drawn from population)} \\ X_{ij} &\sim \mathcal{N}(\mu_j, \sigma^2) \quad \text{(observations within groups)} \end{aligned}

The hyperparameters μ0\mu_0 and τ2\tau^2 are also given priors and inferred from data. This creates partial pooling: small groups borrow strength from larger groups.

In ML: Hierarchical models appear in:

  • Transfer learning — sharing parameters across related tasks
  • Recommender systems — user preferences as random effects
  • Meta-learning — learning from distributions over tasks
  • Mixed effects models — clinical trials with multiple sites

Approximate Bayesian Inference

Exact posteriors are rarely available outside conjugate models. Modern Bayesian methods use approximations:

Variational Inference (VI)

Approximate the posterior P(θD)P(\theta \mid \mathcal{D}) with a simpler distribution q(θ)q(\theta) by minimizing the KL divergence:

q(θ)=argminqQKL(q(θ)P(θD))q^*(\theta) = \arg\min_{q \in \mathcal{Q}} \text{KL}(q(\theta) \| P(\theta \mid \mathcal{D}))

This converts inference into an optimization problem — something we know how to do efficiently.

VI is used in:

  • Variational Autoencoders (VAEs) — learning latent representations
  • Bayesian Neural Networks — uncertainty-aware deep learning
  • Topic models (LDA) — document-topic distributions

Markov Chain Monte Carlo (MCMC)

Generate samples from the posterior by constructing a Markov chain whose stationary distribution is P(θD)P(\theta \mid \mathcal{D}). We cover MCMC in depth in the sampling methods article.

Laplace Approximation

Approximate the posterior as a Gaussian centered at the MAP estimate:

P(θD)N(θMAP,[2logP(θD)θMAP]1)P(\theta \mid \mathcal{D}) \approx \mathcal{N}\left(\theta_{\text{MAP}}, \left[-\nabla^2 \log P(\theta \mid \mathcal{D})\big|_{\theta_{\text{MAP}}}\right]^{-1}\right)

The covariance is the inverse Hessian of the negative log-posterior at the MAP point. This is fast but only accurate when the posterior is unimodal and roughly Gaussian.

Bayesian Deep Learning

Applying Bayesian principles to neural networks:

MC Dropout

Gal and Ghahramani (2016) showed that dropout at test time approximates Bayesian inference. Running the network multiple times with different dropout masks gives a distribution of predictions:

# MC Dropout: approximate Bayesian uncertainty
predictions = []
model.train()  # keep dropout active
for _ in range(100):
    pred = model(x_test)
    predictions.append(pred)

mean_pred = torch.stack(predictions).mean(dim=0)
uncertainty = torch.stack(predictions).std(dim=0)

Deep Ensembles

Training multiple networks with different initializations and averaging their predictions approximates a Bayesian posterior. Lakshminarayanan et al. (2017) showed this gives well-calibrated uncertainty estimates.

Summary

  • Bayesian inference maintains full posterior distributions, not just point estimates
  • Conjugate priors give closed-form posteriors — Beta-Binomial and Gaussian-Gaussian are the key examples
  • The posterior predictive integrates over parameter uncertainty for better-calibrated predictions
  • Bayes factors compare models while naturally penalizing complexity
  • Hierarchical models share information across groups via partial pooling
  • When exact inference is intractable, use variational inference, MCMC, or Laplace approximation
  • In deep learning, MC Dropout and ensembles approximate Bayesian uncertainty
  • For computational methods that sample from posteriors, see Sampling Methods

References

  • Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). Chapman & Hall/CRC.
  • Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapters 4-5.
  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapters 2-3.
  • Gal, Y., & Ghahramani, Z. (2016). “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” ICML 2016. arXiv:1506.02142
  • Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). “Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles.” NeurIPS 2017. arXiv:1612.01474
  • Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). “Variational Inference: A Review for Statisticians.” Journal of the American Statistical Association, 112(518), 859—877. arXiv:1601.00670

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay