Maximum Likelihood Estimation

Probability & Statistics Series 6 / 13

The Big Idea

Given some data and a model with unknown parameters, Maximum Likelihood Estimation (MLE) finds the parameter values that make the observed data most probable.

It answers a simple question: Of all possible parameter values, which ones would have been most likely to generate the data I actually observed?

\hat{\theta}_{\text{MLE}} = \arg\max_{\theta} \, P(\text{data} \mid \theta)

MLE is arguably the most widely used estimation method in all of statistics and machine learning. It builds directly on the probability foundations and distributions we covered previously.

The Likelihood Function

The likelihood function $L(\theta)$ is just the probability of the data viewed as a function of the parameters:

L(\theta) = P(X_1, X_2, \ldots, X_n \mid \theta)

Key distinction: Probability is a function of outcomes (with fixed parameters). Likelihood is a function of parameters (with fixed data). Same formula, different perspective.

If the data points are independent and identically distributed (i.i.d.), the likelihood factors into a product:

L(\theta) = \prod_{i=1}^{n} P(X_i \mid \theta)

Log-Likelihood: A Practical Trick

Products are numerically unstable (they can underflow to zero) and hard to differentiate. Taking the logarithm converts products into sums:

\log L(\theta) = \sum_{i=1}^{n} \log P(X_i \mid \theta)

Since $\log$ is monotonically increasing, maximizing the log-likelihood gives the same answer as maximizing the likelihood. This is the form used in practice.

MLE Recipe

Write down the likelihood $P(\text{data} \mid \theta)$ for your model
Take the log to get the log-likelihood
Differentiate with respect to $\theta$ and set equal to zero
Solve for $\theta$

Let’s apply this to three concrete examples.

Example 1: Bernoulli (Coin Flips)

Setup: Flip a coin $n$ times, observe $k$ heads. What’s the MLE estimate of the probability of heads $p$ ?

Likelihood:

L(p) = p^{k} (1 - p)^{n - k}

Log-likelihood:

\log L(p) = k \log(p) + (n - k) \log(1 - p)

Differentiate and set to zero:

\frac{d}{dp} \log L = \frac{k}{p} - \frac{n - k}{1 - p} = 0

Solve:

\hat{p}_{\text{MLE}} = \frac{k}{n}

The MLE for a coin’s bias is simply the fraction of heads observed. Intuitive and elegant.

Example 2: Gaussian (Normal Distribution)

Setup: Given $n$ observations from a normal distribution, estimate the mean $\mu$ and variance $\sigma^2$ .

Log-likelihood:

\log L(\mu, \sigma^2) = -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log(\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (x_i - \mu)^2

MLE solutions (derived by taking partial derivatives):

\hat{\mu}_{\text{MLE}} = \frac{1}{n} \sum_{i=1}^{n} x_i \quad \text{(the sample mean)}

\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n} \sum_{i=1}^{n} (x_i - \hat{\mu})^2 \quad \text{(the sample variance)}

Note: The MLE variance divides by $n$ , not $n-1$ . This makes it slightly biased (it underestimates the true variance). The unbiased version divides by $n-1$ , which is Bessel’s correction.

Example 3: Linear Regression

In linear regression, we model $y = X\mathbf{w} + \epsilon$ , where $\epsilon \sim \mathcal{N}(0, \sigma^2)$ .

The log-likelihood is:

\log L(\mathbf{w}) = -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (y_i - \mathbf{x}_i^\top \mathbf{w})^2

Maximizing this is equivalent to minimizing the sum of squared errors:

\hat{\mathbf{w}}_{\text{MLE}} = \arg\min_{\mathbf{w}} \sum_{i=1}^{n} (y_i - \mathbf{x}_i^\top \mathbf{w})^2

This is why ordinary least squares and MLE give the same answer for linear regression with Gaussian noise.

Connection to the Exponential Family

For exponential family distributions, MLE reduces to moment matching: set the expected sufficient statistics equal to the observed sufficient statistics. The convexity of the log-partition function $A(\boldsymbol{\eta})$ guarantees that the MLE is unique.

Properties of MLE

MLE has several desirable theoretical properties:

Consistency

As $n \to \infty$ , the MLE converges to the true parameter value. More data means better estimates. This is guaranteed by the Law of Large Numbers.

Asymptotic Normality

For large $n$ , the MLE is approximately normally distributed around the true parameter, thanks to the Central Limit Theorem. This lets us build confidence intervals.

Efficiency

Among all consistent estimators, MLE achieves the lowest possible variance (the Cramer-Rao lower bound) as $n \to \infty$ . No estimator can do better.

Invariance

If $\hat{\theta}_{\text{MLE}}$ is the MLE of $\theta$ , then $g(\hat{\theta}_{\text{MLE}})$ is the MLE of $g(\theta)$ for any function $g$ . Want the MLE of $\sigma$ ? Just take the square root of the MLE of $\sigma^2$ .

Limitations

MLE isn’t perfect:

Overfitting: MLE can overfit with small data. It has no mechanism to prefer simpler models
No uncertainty quantification: MLE gives a point estimate, not a distribution over parameters
Sensitive to model specification: If the model is wrong, MLE can be misleading
Can be biased: For finite samples, MLE estimates may be systematically off (like the variance example)

These limitations motivate Bayesian approaches like MAP estimation, which we cover in the next article.

MLE in Machine Learning

MLE appears everywhere in ML, often in disguise:

ML Method	What MLE Gives You
Linear regression	Least squares solution
Logistic regression	Cross-entropy loss minimization
Neural networks	Standard training with cross-entropy or MSE loss
Naive Bayes	Parameter estimates from counting
Gaussian Mixture Models	EM algorithm (iterative MLE)

When you minimize a loss function in ML, you’re almost always doing MLE (or a regularized version of it).

Summary

MLE finds parameters that maximize the probability of observed data
In practice, we maximize the log-likelihood (turns products into sums)
For common distributions, MLE gives clean, intuitive formulas
MLE is consistent, efficient, and asymptotically normal
Minimizing squared error (regression) and cross-entropy (classification) are both MLE
Main weakness: prone to overfitting, gives point estimates only
Next: MAP estimation adds prior knowledge to overcome MLE’s limitations

References

Fisher, R. A. (1922). “On the Mathematical Foundations of Theoretical Statistics.” Philosophical Transactions of the Royal Society A, 222(594-604), 309—368.
Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury/Thomson Learning.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 2.
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 4.
Myung, I. J. (2003). “Tutorial on Maximum Likelihood Estimation.” Journal of Mathematical Psychology, 47(1), 90—100.