- 01 Probability Fundamentals 02 Random Variables and Expectation 03 Probability Distributions 04 The Exponential Family 05 Convergence and the Central Limit Theorem 06 Maximum Likelihood Estimation 07 MAP Estimation 08 The EM Algorithm 09 Hypothesis Testing 10 Nonparametric Statistics 11 Bayesian Inference 12 Probabilistic Graphical Models 13 Sampling Methods
From MLE to MAP
Maximum Likelihood Estimation finds parameters that best explain the data. But what if we have prior knowledge about what reasonable parameter values look like?
Maximum A Posteriori (MAP) estimation extends MLE by incorporating a prior distribution over the parameters. Instead of asking “what parameters maximize the data likelihood?”, MAP asks:
What parameters are most probable, given both the data and my prior beliefs?
Bayes’ Theorem Connection
MAP is a direct application of Bayes’ theorem:
- — Posterior: what we want to maximize
- — Likelihood: same as in MLE
- — Prior: our belief about before seeing data
- — Evidence: constant with respect to (so we can ignore it for optimization)
Since doesn’t depend on , maximizing the posterior is equivalent to:
Or in log form:
This is beautifully interpretable: MAP = MLE + prior penalty.
MLE vs MAP: A Visual Intuition
Think of it this way:
- MLE looks only at the data and finds the peak of the likelihood
- MAP balances the data (likelihood) with prior knowledge (prior)
- The result is pulled from the MLE toward the prior
With very little data, the prior dominates. With lots of data, the likelihood dominates and MAP converges to MLE.
Key insight: MAP with a uniform (flat) prior is exactly MLE. MLE is just a special case of MAP where we have no prior preference.
Example: Coin Flips with a Prior
Scenario: You flip a coin 3 times and get 3 heads. MLE says — the coin always lands heads. That seems extreme.
MAP approach: Use a Beta prior . With (mild preference for fair coins):
The MAP estimate is instead of — it’s been regularized by the prior toward . With more data, the effect of the prior shrinks.
The Prior-Regularization Connection
This is one of the deepest insights in machine learning: priors correspond to regularization.
Gaussian Prior = L2 Regularization (Ridge)
If we place a Gaussian prior on the parameters:
Then MAP becomes:
where . This is exactly L2 regularization (Ridge regression).
Laplace Prior = L1 Regularization (Lasso)
If we place a Laplace prior:
Then MAP becomes:
This is L1 regularization (Lasso), which encourages sparsity — many parameters become exactly zero.
Every time you add a regularization term to your loss function, you’re implicitly doing MAP estimation with a specific prior.
MAP for Common Models
Linear Regression with Gaussian Prior
This is Ridge regression. The Gaussian prior says “I expect the weights to be small.” The result: every weight shrinks toward zero, but none become exactly zero.
Linear Regression with Laplace Prior
This is Lasso regression. The Laplace prior says “I expect many weights to be zero.” The result: automatic feature selection.
Elastic Net
This combines both priors — some sparsity (L1) with grouping of correlated features (L2).
Connection to the Exponential Family
For exponential family distributions with conjugate priors, the MAP estimate has a clean interpretation: it’s MLE with pseudo-observations added from the prior. The conjugate prior’s hyperparameters act as imaginary data points that regularize the estimate.
Choosing the Prior
The choice of prior matters, especially with limited data:
| Prior | Effect | When to Use |
|---|---|---|
| Uniform (flat) | No regularization (reduces to MLE) | Large datasets, no prior knowledge |
| Gaussian (narrow) | Strong shrinkage toward zero | When you expect small parameter values |
| Gaussian (wide) | Weak shrinkage | Default “gentle” regularization |
| Laplace | Sparsity (many zeros) | High-dimensional data, feature selection |
| Beta | Bounded parameters | Probabilities, proportions |
Empirical Bayes
In practice, the prior’s hyperparameters (like the regularization strength ) are often chosen by cross-validation. This is called Empirical Bayes — we let the data inform our prior.
MAP vs Full Bayesian Inference
MAP gives a point estimate — the single most probable parameter value. Full Bayesian inference computes the entire posterior distribution .
| Aspect | MAP | Full Bayesian |
|---|---|---|
| Output | Single point | Entire distribution |
| Uncertainty | No | Yes (posterior width) |
| Computation | Optimization | Integration (often intractable) |
| Prediction |
Full Bayesian inference is more principled but more expensive. MAP is a practical compromise.
When MAP Helps Most
MAP (regularization) is most valuable when:
- Small datasets — the prior prevents overfitting
- High-dimensional data — many features, few observations
- Ill-conditioned problems — when MLE is unstable
- Prior knowledge exists — you genuinely know something about the parameters
Summary
- MAP combines the likelihood (data) with a prior (beliefs) via Bayes’ theorem
- MAP = MLE + regularization penalty
- Gaussian prior gives L2 regularization (Ridge), Laplace prior gives L1 (Lasso)
- With enough data, MAP converges to MLE — the data overwhelms the prior
- Regularization strength = inverse prior variance
- MAP is a point estimate; full Bayesian inference gives the complete posterior
- Every regularized model is implicitly a MAP estimator with a specific prior
- Next: the EM algorithm handles MLE when data has missing or latent variables
References
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapters 1.2.5 and 3.3.
- Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press. Chapter 3.
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer. Chapters 3.4 and 3.8.
- Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). Chapman & Hall/CRC.