The Exponential Family

Probability & Statistics Series 4 / 13

One Family to Rule Them All

The Bernoulli, Gaussian, Poisson, Exponential, Beta, and Gamma distributions from the previous article might seem like a disconnected collection. But they all share a common mathematical structure — they are all members of the exponential family.

Understanding this family unifies seemingly separate concepts: sufficient statistics, conjugate priors, maximum likelihood, and generalized linear models all become special cases of a single framework.

Definition

A distribution belongs to the exponential family if its PDF or PMF can be written as:

P(x \mid \boldsymbol{\eta}) = h(x) \exp\!\left(\boldsymbol{\eta}^\top \mathbf{T}(x) - A(\boldsymbol{\eta})\right)

where:

$\boldsymbol{\eta}$ — the natural (canonical) parameters
$\mathbf{T}(x)$ — the sufficient statistics
$A(\boldsymbol{\eta})$ — the log-partition function (ensures the distribution normalizes to 1)
$h(x)$ — the base measure (does not depend on parameters)

This looks abstract, but it becomes concrete with examples.

Examples

Bernoulli

The Bernoulli( $p$ ) distribution: $P(x) = p^x (1-p)^{1-x}$ for $x \in \{0, 1\}$ .

Rewriting:

P(x) = \exp\!\left(x \log\frac{p}{1-p} + \log(1-p)\right)

Component	Value
Natural parameter $\eta$	$\log\frac{p}{1-p}$ (the log-odds)
Sufficient statistic $T(x)$	$x$
Log-partition $A(\eta)$	$\log(1 + e^\eta)$
Base measure $h(x)$	$1$

The natural parameter is the log-odds — the same quantity that logistic regression models directly.

Gaussian (Known Variance)

For $\mathcal{N}(\mu, \sigma^2)$ with known $\sigma^2$ :

P(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

= \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(\frac{\mu}{\sigma^2} x - \frac{x^2}{2\sigma^2} - \frac{\mu^2}{2\sigma^2}\right)

Component	Value
Natural parameter $\eta$	$\mu / \sigma^2$
Sufficient statistic $T(x)$	$x$
Log-partition $A(\eta)$	$\sigma^2 \eta^2 / 2$
Base measure $h(x)$	$\frac{1}{\sqrt{2\pi\sigma^2}} \exp(-x^2 / 2\sigma^2)$

Gaussian (Unknown Mean and Variance)

When both parameters are unknown, the exponential family form uses a 2-dimensional natural parameter:

\boldsymbol{\eta} = \begin{bmatrix} \mu/\sigma^2 \\ -1/(2\sigma^2) \end{bmatrix}, \quad \mathbf{T}(x) = \begin{bmatrix} x \\ x^2 \end{bmatrix}

The sufficient statistics are $x$ and $x^2$ — which is why the sample mean and sample variance together capture all information about $(\mu, \sigma^2)$ .

Poisson

For Poisson( $\lambda$ ): $P(x) = \frac{\lambda^x e^{-\lambda}}{x!}$

P(x) = \frac{1}{x!} \exp\!\left(x \log\lambda - \lambda\right)

Component	Value
Natural parameter $\eta$	$\log \lambda$
Sufficient statistic $T(x)$	$x$
Log-partition $A(\eta)$	$e^\eta$
Base measure $h(x)$	$1/x!$

Summary of Members

Distribution	$\eta$	$T(x)$	$A(\eta)$
Bernoulli( $p$ )	$\log\frac{p}{1-p}$	$x$	$\log(1 + e^\eta)$
Poisson( $\lambda$ )	$\log\lambda$	$x$	$e^\eta$
Exponential( $\lambda$ )	$-\lambda$	$x$	$-\log(-\eta)$
Gaussian( $\mu$ , known $\sigma^2$ )	$\mu/\sigma^2$	$x$	$\sigma^2\eta^2/2$
Beta( $\alpha, \beta$ )	$(\alpha-1, \beta-1)$	$(\log x, \log(1-x))$	$\log B(\alpha, \beta)$
Gamma( $\alpha, \beta$ )	$(\alpha-1, -\beta)$	$(\log x, x)$	$\log\Gamma(\alpha) - \alpha\log\beta$

Sufficient Statistics

The sufficient statistic $\mathbf{T}(x)$ captures everything the data tells us about the parameter $\boldsymbol{\eta}$ . No information is lost by reducing the full dataset to its sufficient statistics.

Fisher-Neyman Factorization Theorem

$T(\mathbf{x})$ is sufficient for $\theta$ if and only if the likelihood can be factored as:

P(\mathbf{x} \mid \theta) = g(T(\mathbf{x}), \theta) \cdot h(\mathbf{x})

where $g$ depends on the data only through $T(\mathbf{x})$ , and $h$ doesn’t depend on $\theta$ .

Practical Implication

For $n$ i.i.d. observations from an exponential family, the total sufficient statistic is:

\mathbf{T}_{\text{total}} = \sum_{i=1}^{n} \mathbf{T}(x_i)

This means:

For Bernoulli: $\sum x_i$ (number of successes) is sufficient — you don’t need the raw data
For Gaussian: $(\sum x_i, \sum x_i^2)$ is sufficient — sample mean and sample variance capture everything
For Poisson: $\sum x_i$ (total count) is sufficient

Key insight: This is why MLE solutions for exponential family distributions always involve simple functions of sufficient statistics. The entire dataset compresses into a fixed-dimensional summary without losing any information about the parameters.

The Log-Partition Function

The log-partition function $A(\boldsymbol{\eta})$ is far more than a normalizing constant. It generates the moments of the distribution:

First Derivative = Mean

\nabla_{\boldsymbol{\eta}} A(\boldsymbol{\eta}) = \mathbb{E}[\mathbf{T}(x)]

Second Derivative = Variance

\nabla^2_{\boldsymbol{\eta}} A(\boldsymbol{\eta}) = \text{Cov}[\mathbf{T}(x)]

Since covariance matrices are positive semi-definite, $A(\boldsymbol{\eta})$ is always convex. This convexity is why MLE for exponential families always has a unique global optimum.

Example: Bernoulli

For Bernoulli: $A(\eta) = \log(1 + e^\eta)$

\frac{dA}{d\eta} = \frac{e^\eta}{1 + e^\eta} = \sigma(\eta) = p \quad \text{(the mean)}

\frac{d^2A}{d\eta^2} = \sigma(\eta)(1 - \sigma(\eta)) = p(1-p) \quad \text{(the variance)}

The sigmoid function emerges naturally from the Bernoulli log-partition function — this is the deep reason why logistic regression uses the sigmoid.

MLE for the Exponential Family

For $n$ i.i.d. observations, the MLE has an elegant form. The log-likelihood is:

\log L(\boldsymbol{\eta}) = \boldsymbol{\eta}^\top \sum_{i=1}^n \mathbf{T}(x_i) - n A(\boldsymbol{\eta}) + \sum_{i=1}^n \log h(x_i)

Setting the gradient to zero:

\nabla A(\boldsymbol{\eta}) = \frac{1}{n} \sum_{i=1}^n \mathbf{T}(x_i)

The MLE is found by moment matching: set the expected sufficient statistics equal to the observed sufficient statistics.

This is why MLE always produces intuitive formulas for exponential family distributions:

Bernoulli: $\hat{p} = \bar{x}$ (sample mean matches population mean)
Gaussian: $\hat{\mu} = \bar{x}$ , $\hat{\sigma}^2 = \overline{x^2} - \bar{x}^2$
Poisson: $\hat{\lambda} = \bar{x}$

Conjugate Priors

A major reason the exponential family matters for Bayesian inference: every exponential family distribution has a natural conjugate prior, and it has the form:

P(\boldsymbol{\eta} \mid \boldsymbol{\chi}, \nu) \propto \exp\!\left(\boldsymbol{\eta}^\top \boldsymbol{\chi} - \nu A(\boldsymbol{\eta})\right)

where $\boldsymbol{\chi}$ and $\nu$ are hyperparameters that can be interpreted as:

$\boldsymbol{\chi}$ : “prior pseudo-observations” (total sufficient statistic from imagined prior data)
$\nu$ : “prior sample size” (how many imagined observations)

After observing $n$ data points with total sufficient statistic $\mathbf{T}_{\text{total}}$ :

P(\boldsymbol{\eta} \mid \text{data}) \propto \exp\!\left(\boldsymbol{\eta}^\top (\boldsymbol{\chi} + \mathbf{T}_{\text{total}}) - (n + \nu) A(\boldsymbol{\eta})\right)

The posterior has the same form as the prior, with updated hyperparameters. This is exactly the conjugate update rule we saw for Beta-Binomial and Gaussian-Gaussian in the distributions article.

Generalized Linear Models (GLMs)

The exponential family is the foundation of Generalized Linear Models, which extend linear regression to non-Gaussian responses.

A GLM has three components:

Random component: The response $y$ follows an exponential family distribution
Systematic component: A linear predictor $\boldsymbol{\eta} = \mathbf{X}\mathbf{w}$
Link function: $g(\mu) = \boldsymbol{\eta}$ , connecting the mean to the linear predictor

Common GLMs

Response Type	Distribution	Link	Name
Continuous	Gaussian	Identity: $g(\mu) = \mu$	Linear regression
Binary	Bernoulli	Logit: $g(\mu) = \log\frac{\mu}{1-\mu}$	Logistic regression
Count	Poisson	Log: $g(\mu) = \log\mu$	Poisson regression
Positive continuous	Gamma	Inverse: $g(\mu) = 1/\mu$	Gamma regression

The canonical link function is $g = ({\nabla A})^{-1}$ , which maps the mean to the natural parameter. Using the canonical link simplifies the math and guarantees concave log-likelihoods.

Key insight: Logistic regression is not an ad hoc model — it’s the natural GLM for binary data. The sigmoid (logistic) function appears because it’s the inverse of the canonical link for the Bernoulli distribution.

Information Geometry

The exponential family has deep connections to information theory and differential geometry.

Fisher Information

For an exponential family distribution, the Fisher information matrix equals the Hessian of the log-partition function:

\mathbf{I}(\boldsymbol{\eta}) = \nabla^2 A(\boldsymbol{\eta}) = \text{Cov}[\mathbf{T}(x)]

The Fisher information measures how much information a sample carries about the parameters. It determines:

The Cramer-Rao lower bound on estimator variance
The asymptotic variance of MLE: $\text{Var}(\hat{\eta}_{\text{MLE}}) \to \mathbf{I}(\boldsymbol{\eta})^{-1}/n$
The geometry of the statistical manifold (the space of distributions)

KL Divergence

The KL divergence between two exponential family distributions has a simple form in terms of the log-partition function:

D_{\text{KL}}(P_{\eta_1} \| P_{\eta_2}) = A(\boldsymbol{\eta}_2) - A(\boldsymbol{\eta}_1) - \nabla A(\boldsymbol{\eta}_1)^\top (\boldsymbol{\eta}_2 - \boldsymbol{\eta}_1)

This is the Bregman divergence associated with $A$ — a generalization of squared distance. This connection is why variational inference and natural gradient methods work efficiently for exponential family models.

Why This Matters for Deep Learning

Even in deep learning, the exponential family appears:

Output layers: The final layer of a neural network typically models an exponential family distribution — softmax for categorical (Multinomial), sigmoid for binary (Bernoulli), linear for continuous (Gaussian)
Loss functions: Cross-entropy and MSE loss are negative log-likelihoods of exponential family distributions
Natural gradient descent: Uses the Fisher information to precondition gradients, improving optimization in the space of distributions
Variational autoencoders: The reparameterization trick works cleanly for exponential family distributions because of their tractable moment-generating properties

Distributions Outside the Exponential Family

Not all useful distributions belong to the exponential family:

Uniform $[a, b]$ with unknown endpoints — the support depends on the parameters
Student’s $t$ — used in robust statistics and hypothesis testing
Mixture distributions — mixtures of exponential family members are generally not in the family (this is what makes the EM algorithm necessary)

A distribution fails to be an exponential family member when its support depends on the parameter, or when it cannot be factored into the required form.

Summary

The exponential family provides a unified framework for most distributions used in ML
Its canonical form — $h(x) \exp(\boldsymbol{\eta}^\top \mathbf{T}(x) - A(\boldsymbol{\eta}))$ — reveals deep structure
Sufficient statistics compress data without losing information about parameters
The log-partition function $A(\boldsymbol{\eta})$ generates moments and guarantees convex MLE
Conjugate priors exist naturally for all exponential family members
GLMs extend linear regression to any exponential family response
Fisher information, KL divergence, and natural gradients all simplify for this family
Neural network output layers and loss functions are grounded in exponential family distributions

References

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 2.4.
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 3.4.
Wainwright, M. J., & Jordan, M. I. (2008). “Graphical Models, Exponential Families, and Variational Inference.” Foundations and Trends in Machine Learning, 1(1-2), 1—305.
McCullagh, P., & Nelder, J. A. (1989). Generalized Linear Models (2nd ed.). Chapman & Hall/CRC.
Brown, L. D. (1986). Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory. Institute of Mathematical Statistics.
Amari, S. (2016). Information Geometry and Its Applications. Springer.