- 01 Probability Fundamentals 02 Random Variables and Expectation 03 Probability Distributions 04 The Exponential Family 05 Convergence and the Central Limit Theorem 06 Maximum Likelihood Estimation 07 MAP Estimation 08 The EM Algorithm 09 Hypothesis Testing 10 Nonparametric Statistics 11 Bayesian Inference 12 Probabilistic Graphical Models 13 Sampling Methods
Beyond Point Estimates
MLE and MAP give us single best-guess parameter values. But a point estimate throws away valuable information: how uncertain are we?
Full Bayesian inference keeps the entire posterior distribution over parameters. Instead of saying “the coin’s bias is 0.7,” Bayesian inference says “the bias is probably between 0.6 and 0.8, with peak probability around 0.7.” This distinction matters enormously when data is scarce or decisions are high-stakes.
The Bayesian Framework
Start with Bayes’ theorem applied to model parameters and observed data :
Each term has a specific role:
| Term | Name | Role |
|---|---|---|
| Posterior | Updated belief after seeing data | |
| Likelihood | How probable the data is for each | |
| Prior | Belief before seeing data | |
| Evidence (marginal likelihood) | Normalizing constant |
The evidence term ensures the posterior integrates to 1:
This integral is why Bayesian inference is computationally challenging — it’s often intractable in high dimensions.
Conjugate Priors
A prior is conjugate to a likelihood if the posterior has the same distributional form as the prior. This gives closed-form updates, avoiding the need for numerical integration.
| Likelihood | Conjugate Prior | Posterior |
|---|---|---|
| Bernoulli/Binomial | ||
| Poisson | ||
| Gaussian (known ) | ||
| Gaussian (known ) | ||
| Multinomial |
See the distributions article for details on each of these distribution families. The exponential family article explains why conjugate priors exist naturally for this family.
Example: Beta-Binomial
Suppose we want to estimate a coin’s bias .
Prior: — mild belief that the coin is roughly fair.
Data: 7 heads in 10 flips (, ).
Posterior:
The posterior mean:
Compare this to:
- MLE: (purely data-driven)
- MAP: (mode of posterior)
- Posterior mean: (mean of posterior, a common Bayesian point estimate)
The posterior mean is shrunk toward the prior mean (0.5) more than MAP, reflecting the full distribution’s shape rather than just its peak.
Example: Gaussian-Gaussian
For Gaussian data with known variance , and a Gaussian prior on the mean:
Prior:
Posterior after observing data points with sample mean :
where:
The posterior mean is a precision-weighted average of the prior mean and the sample mean. The more data you have (larger ), the closer gets to .
Key insight: The posterior precision (inverse variance) is the sum of prior precision and data precision: . Information from independent sources adds.
Posterior Predictive Distribution
Once we have the posterior over parameters, we can make predictions that integrate over our uncertainty:
This is the posterior predictive distribution. Instead of plugging in a single , we average predictions over all plausible parameter values, weighted by their posterior probability.
Why This Matters
With a point estimate:
With full Bayesian:
The Bayesian prediction is generally better calibrated — it’s wider when we’re uncertain and narrower when we’re confident.
Example: After 3 coin flips (all heads), MLE predicts the next flip is heads with probability 1.0. The Bayesian predictive (with Beta(1,1) prior) gives — more reasonable.
Bayesian Model Comparison
The evidence — often dismissed as “just a normalizing constant” — is actually the key to Bayesian model comparison.
The Bayes factor compares two models:
| Bayes Factor | Evidence |
|---|---|
| 1 — 3 | Barely worth mentioning |
| 3 — 10 | Moderate |
| 10 — 30 | Strong |
| 30 — 100 | Very strong |
| > 100 | Decisive |
The Bayes factor naturally penalizes model complexity — a model with more parameters must spread its prior over a larger space, reducing the marginal likelihood unless the data strongly supports it. This is called the Bayesian Occam’s razor.
Bayesian vs Frequentist
These are two fundamentally different philosophies of probability:
| Aspect | Frequentist | Bayesian |
|---|---|---|
| Probability | Long-run frequency of events | Degree of belief |
| Parameters | Fixed but unknown | Random variables |
| Inference | Point estimates + confidence intervals | Posterior distributions |
| Prior knowledge | Not used | Explicitly encoded |
| Uncertainty | Via sampling distributions | Via posterior width |
| Computation | Usually analytical | Often requires MCMC |
When to Use Each
Frequentist (MLE, hypothesis tests):
- Large datasets where the prior doesn’t matter
- When stakeholders expect p-values and confidence intervals
- When computational resources are limited
Bayesian:
- Small datasets where prior knowledge helps
- When you need uncertainty quantification
- Sequential decision-making (updating beliefs as data arrives)
- When you want to compare models naturally (Bayes factors)
In practice, the distinction blurs. Regularized MLE is MAP estimation. Neural network ensembles approximate Bayesian posteriors. The best practitioners use both frameworks where appropriate.
Hierarchical Bayesian Models
When you have groups of related parameters, hierarchical (multilevel) models share information across groups:
The hyperparameters and are also given priors and inferred from data. This creates partial pooling: small groups borrow strength from larger groups.
In ML: Hierarchical models appear in:
- Transfer learning — sharing parameters across related tasks
- Recommender systems — user preferences as random effects
- Meta-learning — learning from distributions over tasks
- Mixed effects models — clinical trials with multiple sites
Approximate Bayesian Inference
Exact posteriors are rarely available outside conjugate models. Modern Bayesian methods use approximations:
Variational Inference (VI)
Approximate the posterior with a simpler distribution by minimizing the KL divergence:
This converts inference into an optimization problem — something we know how to do efficiently.
VI is used in:
- Variational Autoencoders (VAEs) — learning latent representations
- Bayesian Neural Networks — uncertainty-aware deep learning
- Topic models (LDA) — document-topic distributions
Markov Chain Monte Carlo (MCMC)
Generate samples from the posterior by constructing a Markov chain whose stationary distribution is . We cover MCMC in depth in the sampling methods article.
Laplace Approximation
Approximate the posterior as a Gaussian centered at the MAP estimate:
The covariance is the inverse Hessian of the negative log-posterior at the MAP point. This is fast but only accurate when the posterior is unimodal and roughly Gaussian.
Bayesian Deep Learning
Applying Bayesian principles to neural networks:
MC Dropout
Gal and Ghahramani (2016) showed that dropout at test time approximates Bayesian inference. Running the network multiple times with different dropout masks gives a distribution of predictions:
# MC Dropout: approximate Bayesian uncertainty
predictions = []
model.train() # keep dropout active
for _ in range(100):
pred = model(x_test)
predictions.append(pred)
mean_pred = torch.stack(predictions).mean(dim=0)
uncertainty = torch.stack(predictions).std(dim=0)
Deep Ensembles
Training multiple networks with different initializations and averaging their predictions approximates a Bayesian posterior. Lakshminarayanan et al. (2017) showed this gives well-calibrated uncertainty estimates.
Summary
- Bayesian inference maintains full posterior distributions, not just point estimates
- Conjugate priors give closed-form posteriors — Beta-Binomial and Gaussian-Gaussian are the key examples
- The posterior predictive integrates over parameter uncertainty for better-calibrated predictions
- Bayes factors compare models while naturally penalizing complexity
- Hierarchical models share information across groups via partial pooling
- When exact inference is intractable, use variational inference, MCMC, or Laplace approximation
- In deep learning, MC Dropout and ensembles approximate Bayesian uncertainty
- For computational methods that sample from posteriors, see Sampling Methods
References
- Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). Chapman & Hall/CRC.
- Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapters 4-5.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapters 2-3.
- Gal, Y., & Ghahramani, Z. (2016). “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” ICML 2016. arXiv:1506.02142
- Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). “Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles.” NeurIPS 2017. arXiv:1612.01474
- Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). “Variational Inference: A Review for Statisticians.” Journal of the American Statistical Association, 112(518), 859—877. arXiv:1601.00670