- 01 Probability Fundamentals 02 Random Variables and Expectation 03 Probability Distributions 04 The Exponential Family 05 Convergence and the Central Limit Theorem 06 Maximum Likelihood Estimation 07 MAP Estimation 08 The EM Algorithm 09 Hypothesis Testing 10 Nonparametric Statistics 11 Bayesian Inference 12 Probabilistic Graphical Models 13 Sampling Methods
The Big Idea
Given some data and a model with unknown parameters, Maximum Likelihood Estimation (MLE) finds the parameter values that make the observed data most probable.
It answers a simple question: Of all possible parameter values, which ones would have been most likely to generate the data I actually observed?
MLE is arguably the most widely used estimation method in all of statistics and machine learning. It builds directly on the probability foundations and distributions we covered previously.
The Likelihood Function
The likelihood function is just the probability of the data viewed as a function of the parameters:
Key distinction: Probability is a function of outcomes (with fixed parameters). Likelihood is a function of parameters (with fixed data). Same formula, different perspective.
If the data points are independent and identically distributed (i.i.d.), the likelihood factors into a product:
Log-Likelihood: A Practical Trick
Products are numerically unstable (they can underflow to zero) and hard to differentiate. Taking the logarithm converts products into sums:
Since is monotonically increasing, maximizing the log-likelihood gives the same answer as maximizing the likelihood. This is the form used in practice.
MLE Recipe
- Write down the likelihood for your model
- Take the log to get the log-likelihood
- Differentiate with respect to and set equal to zero
- Solve for
Let’s apply this to three concrete examples.
Example 1: Bernoulli (Coin Flips)
Setup: Flip a coin times, observe heads. What’s the MLE estimate of the probability of heads ?
Likelihood:
Log-likelihood:
Differentiate and set to zero:
Solve:
The MLE for a coin’s bias is simply the fraction of heads observed. Intuitive and elegant.
Example 2: Gaussian (Normal Distribution)
Setup: Given observations from a normal distribution, estimate the mean and variance .
Log-likelihood:
MLE solutions (derived by taking partial derivatives):
Note: The MLE variance divides by , not . This makes it slightly biased (it underestimates the true variance). The unbiased version divides by , which is Bessel’s correction.
Example 3: Linear Regression
In linear regression, we model , where .
The log-likelihood is:
Maximizing this is equivalent to minimizing the sum of squared errors:
This is why ordinary least squares and MLE give the same answer for linear regression with Gaussian noise.
Connection to the Exponential Family
For exponential family distributions, MLE reduces to moment matching: set the expected sufficient statistics equal to the observed sufficient statistics. The convexity of the log-partition function guarantees that the MLE is unique.
Properties of MLE
MLE has several desirable theoretical properties:
Consistency
As , the MLE converges to the true parameter value. More data means better estimates. This is guaranteed by the Law of Large Numbers.
Asymptotic Normality
For large , the MLE is approximately normally distributed around the true parameter, thanks to the Central Limit Theorem. This lets us build confidence intervals.
Efficiency
Among all consistent estimators, MLE achieves the lowest possible variance (the Cramer-Rao lower bound) as . No estimator can do better.
Invariance
If is the MLE of , then is the MLE of for any function . Want the MLE of ? Just take the square root of the MLE of .
Limitations
MLE isn’t perfect:
- Overfitting: MLE can overfit with small data. It has no mechanism to prefer simpler models
- No uncertainty quantification: MLE gives a point estimate, not a distribution over parameters
- Sensitive to model specification: If the model is wrong, MLE can be misleading
- Can be biased: For finite samples, MLE estimates may be systematically off (like the variance example)
These limitations motivate Bayesian approaches like MAP estimation, which we cover in the next article.
MLE in Machine Learning
MLE appears everywhere in ML, often in disguise:
| ML Method | What MLE Gives You |
|---|---|
| Linear regression | Least squares solution |
| Logistic regression | Cross-entropy loss minimization |
| Neural networks | Standard training with cross-entropy or MSE loss |
| Naive Bayes | Parameter estimates from counting |
| Gaussian Mixture Models | EM algorithm (iterative MLE) |
When you minimize a loss function in ML, you’re almost always doing MLE (or a regularized version of it).
Summary
- MLE finds parameters that maximize the probability of observed data
- In practice, we maximize the log-likelihood (turns products into sums)
- For common distributions, MLE gives clean, intuitive formulas
- MLE is consistent, efficient, and asymptotically normal
- Minimizing squared error (regression) and cross-entropy (classification) are both MLE
- Main weakness: prone to overfitting, gives point estimates only
- Next: MAP estimation adds prior knowledge to overcome MLE’s limitations
References
- Fisher, R. A. (1922). “On the Mathematical Foundations of Theoretical Statistics.” Philosophical Transactions of the Royal Society A, 222(594-604), 309—368.
- Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury/Thomson Learning.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 2.
- Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 4.
- Myung, I. J. (2003). “Tutorial on Maximum Likelihood Estimation.” Journal of Mathematical Psychology, 47(1), 90—100.