- 01 Probability Fundamentals 02 Random Variables and Expectation 03 Probability Distributions 04 The Exponential Family 05 Convergence and the Central Limit Theorem 06 Maximum Likelihood Estimation 07 MAP Estimation 08 The EM Algorithm 09 Hypothesis Testing 10 Nonparametric Statistics 11 Bayesian Inference 12 Probabilistic Graphical Models 13 Sampling Methods
One Family to Rule Them All
The Bernoulli, Gaussian, Poisson, Exponential, Beta, and Gamma distributions from the previous article might seem like a disconnected collection. But they all share a common mathematical structure — they are all members of the exponential family.
Understanding this family unifies seemingly separate concepts: sufficient statistics, conjugate priors, maximum likelihood, and generalized linear models all become special cases of a single framework.
Definition
A distribution belongs to the exponential family if its PDF or PMF can be written as:
where:
- — the natural (canonical) parameters
- — the sufficient statistics
- — the log-partition function (ensures the distribution normalizes to 1)
- — the base measure (does not depend on parameters)
This looks abstract, but it becomes concrete with examples.
Examples
Bernoulli
The Bernoulli() distribution: for .
Rewriting:
| Component | Value |
|---|---|
| Natural parameter | (the log-odds) |
| Sufficient statistic | |
| Log-partition | |
| Base measure |
The natural parameter is the log-odds — the same quantity that logistic regression models directly.
Gaussian (Known Variance)
For with known :
| Component | Value |
|---|---|
| Natural parameter | |
| Sufficient statistic | |
| Log-partition | |
| Base measure |
Gaussian (Unknown Mean and Variance)
When both parameters are unknown, the exponential family form uses a 2-dimensional natural parameter:
The sufficient statistics are and — which is why the sample mean and sample variance together capture all information about .
Poisson
For Poisson():
| Component | Value |
|---|---|
| Natural parameter | |
| Sufficient statistic | |
| Log-partition | |
| Base measure |
Summary of Members
| Distribution | |||
|---|---|---|---|
| Bernoulli() | |||
| Poisson() | |||
| Exponential() | |||
| Gaussian(, known ) | |||
| Beta() | |||
| Gamma() |
Sufficient Statistics
The sufficient statistic captures everything the data tells us about the parameter . No information is lost by reducing the full dataset to its sufficient statistics.
Fisher-Neyman Factorization Theorem
is sufficient for if and only if the likelihood can be factored as:
where depends on the data only through , and doesn’t depend on .
Practical Implication
For i.i.d. observations from an exponential family, the total sufficient statistic is:
This means:
- For Bernoulli: (number of successes) is sufficient — you don’t need the raw data
- For Gaussian: is sufficient — sample mean and sample variance capture everything
- For Poisson: (total count) is sufficient
Key insight: This is why MLE solutions for exponential family distributions always involve simple functions of sufficient statistics. The entire dataset compresses into a fixed-dimensional summary without losing any information about the parameters.
The Log-Partition Function
The log-partition function is far more than a normalizing constant. It generates the moments of the distribution:
First Derivative = Mean
Second Derivative = Variance
Since covariance matrices are positive semi-definite, is always convex. This convexity is why MLE for exponential families always has a unique global optimum.
Example: Bernoulli
For Bernoulli:
The sigmoid function emerges naturally from the Bernoulli log-partition function — this is the deep reason why logistic regression uses the sigmoid.
MLE for the Exponential Family
For i.i.d. observations, the MLE has an elegant form. The log-likelihood is:
Setting the gradient to zero:
The MLE is found by moment matching: set the expected sufficient statistics equal to the observed sufficient statistics.
This is why MLE always produces intuitive formulas for exponential family distributions:
- Bernoulli: (sample mean matches population mean)
- Gaussian: ,
- Poisson:
Conjugate Priors
A major reason the exponential family matters for Bayesian inference: every exponential family distribution has a natural conjugate prior, and it has the form:
where and are hyperparameters that can be interpreted as:
- : “prior pseudo-observations” (total sufficient statistic from imagined prior data)
- : “prior sample size” (how many imagined observations)
After observing data points with total sufficient statistic :
The posterior has the same form as the prior, with updated hyperparameters. This is exactly the conjugate update rule we saw for Beta-Binomial and Gaussian-Gaussian in the distributions article.
Generalized Linear Models (GLMs)
The exponential family is the foundation of Generalized Linear Models, which extend linear regression to non-Gaussian responses.
A GLM has three components:
- Random component: The response follows an exponential family distribution
- Systematic component: A linear predictor
- Link function: , connecting the mean to the linear predictor
Common GLMs
| Response Type | Distribution | Link | Name |
|---|---|---|---|
| Continuous | Gaussian | Identity: | Linear regression |
| Binary | Bernoulli | Logit: | Logistic regression |
| Count | Poisson | Log: | Poisson regression |
| Positive continuous | Gamma | Inverse: | Gamma regression |
The canonical link function is , which maps the mean to the natural parameter. Using the canonical link simplifies the math and guarantees concave log-likelihoods.
Key insight: Logistic regression is not an ad hoc model — it’s the natural GLM for binary data. The sigmoid (logistic) function appears because it’s the inverse of the canonical link for the Bernoulli distribution.
Information Geometry
The exponential family has deep connections to information theory and differential geometry.
Fisher Information
For an exponential family distribution, the Fisher information matrix equals the Hessian of the log-partition function:
The Fisher information measures how much information a sample carries about the parameters. It determines:
- The Cramer-Rao lower bound on estimator variance
- The asymptotic variance of MLE:
- The geometry of the statistical manifold (the space of distributions)
KL Divergence
The KL divergence between two exponential family distributions has a simple form in terms of the log-partition function:
This is the Bregman divergence associated with — a generalization of squared distance. This connection is why variational inference and natural gradient methods work efficiently for exponential family models.
Why This Matters for Deep Learning
Even in deep learning, the exponential family appears:
- Output layers: The final layer of a neural network typically models an exponential family distribution — softmax for categorical (Multinomial), sigmoid for binary (Bernoulli), linear for continuous (Gaussian)
- Loss functions: Cross-entropy and MSE loss are negative log-likelihoods of exponential family distributions
- Natural gradient descent: Uses the Fisher information to precondition gradients, improving optimization in the space of distributions
- Variational autoencoders: The reparameterization trick works cleanly for exponential family distributions because of their tractable moment-generating properties
Distributions Outside the Exponential Family
Not all useful distributions belong to the exponential family:
- Uniform with unknown endpoints — the support depends on the parameters
- Student’s — used in robust statistics and hypothesis testing
- Mixture distributions — mixtures of exponential family members are generally not in the family (this is what makes the EM algorithm necessary)
A distribution fails to be an exponential family member when its support depends on the parameter, or when it cannot be factored into the required form.
Summary
- The exponential family provides a unified framework for most distributions used in ML
- Its canonical form — — reveals deep structure
- Sufficient statistics compress data without losing information about parameters
- The log-partition function generates moments and guarantees convex MLE
- Conjugate priors exist naturally for all exponential family members
- GLMs extend linear regression to any exponential family response
- Fisher information, KL divergence, and natural gradients all simplify for this family
- Neural network output layers and loss functions are grounded in exponential family distributions
References
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 2.4.
- Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 3.4.
- Wainwright, M. J., & Jordan, M. I. (2008). “Graphical Models, Exponential Families, and Variational Inference.” Foundations and Trends in Machine Learning, 1(1-2), 1—305.
- McCullagh, P., & Nelder, J. A. (1989). Generalized Linear Models (2nd ed.). Chapman & Hall/CRC.
- Brown, L. D. (1986). Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory. Institute of Mathematical Statistics.
- Amari, S. (2016). Information Geometry and Its Applications. Springer.