Maximum Likelihood Estimation

How to find the best parameters for a model by maximizing the probability of observed data.

Probability & Statistics February 18, 2026 6 min read

The Big Idea

Given some data and a model with unknown parameters, Maximum Likelihood Estimation (MLE) finds the parameter values that make the observed data most probable.

It answers a simple question: Of all possible parameter values, which ones would have been most likely to generate the data I actually observed?

θ^MLE=argmaxθP(dataθ)\hat{\theta}_{\text{MLE}} = \arg\max_{\theta} \, P(\text{data} \mid \theta)

MLE is arguably the most widely used estimation method in all of statistics and machine learning. It builds directly on the probability foundations and distributions we covered previously.

The Likelihood Function

The likelihood function L(θ)L(\theta) is just the probability of the data viewed as a function of the parameters:

L(θ)=P(X1,X2,,Xnθ)L(\theta) = P(X_1, X_2, \ldots, X_n \mid \theta)

Key distinction: Probability is a function of outcomes (with fixed parameters). Likelihood is a function of parameters (with fixed data). Same formula, different perspective.

If the data points are independent and identically distributed (i.i.d.), the likelihood factors into a product:

L(θ)=i=1nP(Xiθ)L(\theta) = \prod_{i=1}^{n} P(X_i \mid \theta)

Log-Likelihood: A Practical Trick

Products are numerically unstable (they can underflow to zero) and hard to differentiate. Taking the logarithm converts products into sums:

logL(θ)=i=1nlogP(Xiθ)\log L(\theta) = \sum_{i=1}^{n} \log P(X_i \mid \theta)

Since log\log is monotonically increasing, maximizing the log-likelihood gives the same answer as maximizing the likelihood. This is the form used in practice.

MLE Recipe

  1. Write down the likelihood P(dataθ)P(\text{data} \mid \theta) for your model
  2. Take the log to get the log-likelihood
  3. Differentiate with respect to θ\theta and set equal to zero
  4. Solve for θ\theta

Let’s apply this to three concrete examples.

Example 1: Bernoulli (Coin Flips)

Setup: Flip a coin nn times, observe kk heads. What’s the MLE estimate of the probability of heads pp?

Likelihood:

L(p)=pk(1p)nkL(p) = p^{k} (1 - p)^{n - k}

Log-likelihood:

logL(p)=klog(p)+(nk)log(1p)\log L(p) = k \log(p) + (n - k) \log(1 - p)

Differentiate and set to zero:

ddplogL=kpnk1p=0\frac{d}{dp} \log L = \frac{k}{p} - \frac{n - k}{1 - p} = 0

Solve:

p^MLE=kn\hat{p}_{\text{MLE}} = \frac{k}{n}

The MLE for a coin’s bias is simply the fraction of heads observed. Intuitive and elegant.

Example 2: Gaussian (Normal Distribution)

Setup: Given nn observations from a normal distribution, estimate the mean μ\mu and variance σ2\sigma^2.

Log-likelihood:

logL(μ,σ2)=n2log(2π)n2log(σ2)12σ2i=1n(xiμ)2\log L(\mu, \sigma^2) = -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log(\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (x_i - \mu)^2

MLE solutions (derived by taking partial derivatives):

μ^MLE=1ni=1nxi(the sample mean)\hat{\mu}_{\text{MLE}} = \frac{1}{n} \sum_{i=1}^{n} x_i \quad \text{(the sample mean)} σ^MLE2=1ni=1n(xiμ^)2(the sample variance)\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n} \sum_{i=1}^{n} (x_i - \hat{\mu})^2 \quad \text{(the sample variance)}

Note: The MLE variance divides by nn, not n1n-1. This makes it slightly biased (it underestimates the true variance). The unbiased version divides by n1n-1, which is Bessel’s correction.

Example 3: Linear Regression

In linear regression, we model y=Xw+ϵy = X\mathbf{w} + \epsilon, where ϵN(0,σ2)\epsilon \sim \mathcal{N}(0, \sigma^2).

The log-likelihood is:

logL(w)=n2log(2πσ2)12σ2i=1n(yixiw)2\log L(\mathbf{w}) = -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (y_i - \mathbf{x}_i^\top \mathbf{w})^2

Maximizing this is equivalent to minimizing the sum of squared errors:

w^MLE=argminwi=1n(yixiw)2\hat{\mathbf{w}}_{\text{MLE}} = \arg\min_{\mathbf{w}} \sum_{i=1}^{n} (y_i - \mathbf{x}_i^\top \mathbf{w})^2

This is why ordinary least squares and MLE give the same answer for linear regression with Gaussian noise.

Connection to the Exponential Family

For exponential family distributions, MLE reduces to moment matching: set the expected sufficient statistics equal to the observed sufficient statistics. The convexity of the log-partition function A(η)A(\boldsymbol{\eta}) guarantees that the MLE is unique.

Properties of MLE

MLE has several desirable theoretical properties:

Consistency

As nn \to \infty, the MLE converges to the true parameter value. More data means better estimates. This is guaranteed by the Law of Large Numbers.

Asymptotic Normality

For large nn, the MLE is approximately normally distributed around the true parameter, thanks to the Central Limit Theorem. This lets us build confidence intervals.

Efficiency

Among all consistent estimators, MLE achieves the lowest possible variance (the Cramer-Rao lower bound) as nn \to \infty. No estimator can do better.

Invariance

If θ^MLE\hat{\theta}_{\text{MLE}} is the MLE of θ\theta, then g(θ^MLE)g(\hat{\theta}_{\text{MLE}}) is the MLE of g(θ)g(\theta) for any function gg. Want the MLE of σ\sigma? Just take the square root of the MLE of σ2\sigma^2.

Limitations

MLE isn’t perfect:

  • Overfitting: MLE can overfit with small data. It has no mechanism to prefer simpler models
  • No uncertainty quantification: MLE gives a point estimate, not a distribution over parameters
  • Sensitive to model specification: If the model is wrong, MLE can be misleading
  • Can be biased: For finite samples, MLE estimates may be systematically off (like the variance example)

These limitations motivate Bayesian approaches like MAP estimation, which we cover in the next article.

MLE in Machine Learning

MLE appears everywhere in ML, often in disguise:

ML MethodWhat MLE Gives You
Linear regressionLeast squares solution
Logistic regressionCross-entropy loss minimization
Neural networksStandard training with cross-entropy or MSE loss
Naive BayesParameter estimates from counting
Gaussian Mixture ModelsEM algorithm (iterative MLE)

When you minimize a loss function in ML, you’re almost always doing MLE (or a regularized version of it).

Summary

  • MLE finds parameters that maximize the probability of observed data
  • In practice, we maximize the log-likelihood (turns products into sums)
  • For common distributions, MLE gives clean, intuitive formulas
  • MLE is consistent, efficient, and asymptotically normal
  • Minimizing squared error (regression) and cross-entropy (classification) are both MLE
  • Main weakness: prone to overfitting, gives point estimates only
  • Next: MAP estimation adds prior knowledge to overcome MLE’s limitations

References

  • Fisher, R. A. (1922). “On the Mathematical Foundations of Theoretical Statistics.” Philosophical Transactions of the Royal Society A, 222(594-604), 309—368.
  • Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury/Thomson Learning.
  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 2.
  • Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 4.
  • Myung, I. J. (2003). “Tutorial on Maximum Likelihood Estimation.” Journal of Mathematical Psychology, 47(1), 90—100.

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay