The Exponential Family

A unifying framework for probability distributions: sufficient statistics, conjugate priors, and why most ML distributions share a common structure.

Probability & Statistics March 6, 2026 9 min read

One Family to Rule Them All

The Bernoulli, Gaussian, Poisson, Exponential, Beta, and Gamma distributions from the previous article might seem like a disconnected collection. But they all share a common mathematical structure — they are all members of the exponential family.

Understanding this family unifies seemingly separate concepts: sufficient statistics, conjugate priors, maximum likelihood, and generalized linear models all become special cases of a single framework.

Definition

A distribution belongs to the exponential family if its PDF or PMF can be written as:

P(xη)=h(x)exp ⁣(ηT(x)A(η))P(x \mid \boldsymbol{\eta}) = h(x) \exp\!\left(\boldsymbol{\eta}^\top \mathbf{T}(x) - A(\boldsymbol{\eta})\right)

where:

  • η\boldsymbol{\eta} — the natural (canonical) parameters
  • T(x)\mathbf{T}(x) — the sufficient statistics
  • A(η)A(\boldsymbol{\eta}) — the log-partition function (ensures the distribution normalizes to 1)
  • h(x)h(x) — the base measure (does not depend on parameters)

This looks abstract, but it becomes concrete with examples.

Examples

Bernoulli

The Bernoulli(pp) distribution: P(x)=px(1p)1xP(x) = p^x (1-p)^{1-x} for x{0,1}x \in \{0, 1\}.

Rewriting:

P(x)=exp ⁣(xlogp1p+log(1p))P(x) = \exp\!\left(x \log\frac{p}{1-p} + \log(1-p)\right)
ComponentValue
Natural parameter η\etalogp1p\log\frac{p}{1-p} (the log-odds)
Sufficient statistic T(x)T(x)xx
Log-partition A(η)A(\eta)log(1+eη)\log(1 + e^\eta)
Base measure h(x)h(x)11

The natural parameter is the log-odds — the same quantity that logistic regression models directly.

Gaussian (Known Variance)

For N(μ,σ2)\mathcal{N}(\mu, \sigma^2) with known σ2\sigma^2:

P(x)=12πσ2exp ⁣((xμ)22σ2)P(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) =12πσ2exp ⁣(μσ2xx22σ2μ22σ2)= \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(\frac{\mu}{\sigma^2} x - \frac{x^2}{2\sigma^2} - \frac{\mu^2}{2\sigma^2}\right)
ComponentValue
Natural parameter η\etaμ/σ2\mu / \sigma^2
Sufficient statistic T(x)T(x)xx
Log-partition A(η)A(\eta)σ2η2/2\sigma^2 \eta^2 / 2
Base measure h(x)h(x)12πσ2exp(x2/2σ2)\frac{1}{\sqrt{2\pi\sigma^2}} \exp(-x^2 / 2\sigma^2)

Gaussian (Unknown Mean and Variance)

When both parameters are unknown, the exponential family form uses a 2-dimensional natural parameter:

η=[μ/σ21/(2σ2)],T(x)=[xx2]\boldsymbol{\eta} = \begin{bmatrix} \mu/\sigma^2 \\ -1/(2\sigma^2) \end{bmatrix}, \quad \mathbf{T}(x) = \begin{bmatrix} x \\ x^2 \end{bmatrix}

The sufficient statistics are xx and x2x^2 — which is why the sample mean and sample variance together capture all information about (μ,σ2)(\mu, \sigma^2).

Poisson

For Poisson(λ\lambda): P(x)=λxeλx!P(x) = \frac{\lambda^x e^{-\lambda}}{x!}

P(x)=1x!exp ⁣(xlogλλ)P(x) = \frac{1}{x!} \exp\!\left(x \log\lambda - \lambda\right)
ComponentValue
Natural parameter η\etalogλ\log \lambda
Sufficient statistic T(x)T(x)xx
Log-partition A(η)A(\eta)eηe^\eta
Base measure h(x)h(x)1/x!1/x!

Summary of Members

Distributionη\etaT(x)T(x)A(η)A(\eta)
Bernoulli(pp)logp1p\log\frac{p}{1-p}xxlog(1+eη)\log(1 + e^\eta)
Poisson(λ\lambda)logλ\log\lambdaxxeηe^\eta
Exponential(λ\lambda)λ-\lambdaxxlog(η)-\log(-\eta)
Gaussian(μ\mu, known σ2\sigma^2)μ/σ2\mu/\sigma^2xxσ2η2/2\sigma^2\eta^2/2
Beta(α,β\alpha, \beta)(α1,β1)(\alpha-1, \beta-1)(logx,log(1x))(\log x, \log(1-x))logB(α,β)\log B(\alpha, \beta)
Gamma(α,β\alpha, \beta)(α1,β)(\alpha-1, -\beta)(logx,x)(\log x, x)logΓ(α)αlogβ\log\Gamma(\alpha) - \alpha\log\beta

Sufficient Statistics

The sufficient statistic T(x)\mathbf{T}(x) captures everything the data tells us about the parameter η\boldsymbol{\eta}. No information is lost by reducing the full dataset to its sufficient statistics.

Fisher-Neyman Factorization Theorem

T(x)T(\mathbf{x}) is sufficient for θ\theta if and only if the likelihood can be factored as:

P(xθ)=g(T(x),θ)h(x)P(\mathbf{x} \mid \theta) = g(T(\mathbf{x}), \theta) \cdot h(\mathbf{x})

where gg depends on the data only through T(x)T(\mathbf{x}), and hh doesn’t depend on θ\theta.

Practical Implication

For nn i.i.d. observations from an exponential family, the total sufficient statistic is:

Ttotal=i=1nT(xi)\mathbf{T}_{\text{total}} = \sum_{i=1}^{n} \mathbf{T}(x_i)

This means:

  • For Bernoulli: xi\sum x_i (number of successes) is sufficient — you don’t need the raw data
  • For Gaussian: (xi,xi2)(\sum x_i, \sum x_i^2) is sufficient — sample mean and sample variance capture everything
  • For Poisson: xi\sum x_i (total count) is sufficient

Key insight: This is why MLE solutions for exponential family distributions always involve simple functions of sufficient statistics. The entire dataset compresses into a fixed-dimensional summary without losing any information about the parameters.

The Log-Partition Function

The log-partition function A(η)A(\boldsymbol{\eta}) is far more than a normalizing constant. It generates the moments of the distribution:

First Derivative = Mean

ηA(η)=E[T(x)]\nabla_{\boldsymbol{\eta}} A(\boldsymbol{\eta}) = \mathbb{E}[\mathbf{T}(x)]

Second Derivative = Variance

η2A(η)=Cov[T(x)]\nabla^2_{\boldsymbol{\eta}} A(\boldsymbol{\eta}) = \text{Cov}[\mathbf{T}(x)]

Since covariance matrices are positive semi-definite, A(η)A(\boldsymbol{\eta}) is always convex. This convexity is why MLE for exponential families always has a unique global optimum.

Example: Bernoulli

For Bernoulli: A(η)=log(1+eη)A(\eta) = \log(1 + e^\eta)

dAdη=eη1+eη=σ(η)=p(the mean)\frac{dA}{d\eta} = \frac{e^\eta}{1 + e^\eta} = \sigma(\eta) = p \quad \text{(the mean)} d2Adη2=σ(η)(1σ(η))=p(1p)(the variance)\frac{d^2A}{d\eta^2} = \sigma(\eta)(1 - \sigma(\eta)) = p(1-p) \quad \text{(the variance)}

The sigmoid function emerges naturally from the Bernoulli log-partition function — this is the deep reason why logistic regression uses the sigmoid.

MLE for the Exponential Family

For nn i.i.d. observations, the MLE has an elegant form. The log-likelihood is:

logL(η)=ηi=1nT(xi)nA(η)+i=1nlogh(xi)\log L(\boldsymbol{\eta}) = \boldsymbol{\eta}^\top \sum_{i=1}^n \mathbf{T}(x_i) - n A(\boldsymbol{\eta}) + \sum_{i=1}^n \log h(x_i)

Setting the gradient to zero:

A(η)=1ni=1nT(xi)\nabla A(\boldsymbol{\eta}) = \frac{1}{n} \sum_{i=1}^n \mathbf{T}(x_i)

The MLE is found by moment matching: set the expected sufficient statistics equal to the observed sufficient statistics.

This is why MLE always produces intuitive formulas for exponential family distributions:

  • Bernoulli: p^=xˉ\hat{p} = \bar{x} (sample mean matches population mean)
  • Gaussian: μ^=xˉ\hat{\mu} = \bar{x}, σ^2=x2xˉ2\hat{\sigma}^2 = \overline{x^2} - \bar{x}^2
  • Poisson: λ^=xˉ\hat{\lambda} = \bar{x}

Conjugate Priors

A major reason the exponential family matters for Bayesian inference: every exponential family distribution has a natural conjugate prior, and it has the form:

P(ηχ,ν)exp ⁣(ηχνA(η))P(\boldsymbol{\eta} \mid \boldsymbol{\chi}, \nu) \propto \exp\!\left(\boldsymbol{\eta}^\top \boldsymbol{\chi} - \nu A(\boldsymbol{\eta})\right)

where χ\boldsymbol{\chi} and ν\nu are hyperparameters that can be interpreted as:

  • χ\boldsymbol{\chi}: “prior pseudo-observations” (total sufficient statistic from imagined prior data)
  • ν\nu: “prior sample size” (how many imagined observations)

After observing nn data points with total sufficient statistic Ttotal\mathbf{T}_{\text{total}}:

P(ηdata)exp ⁣(η(χ+Ttotal)(n+ν)A(η))P(\boldsymbol{\eta} \mid \text{data}) \propto \exp\!\left(\boldsymbol{\eta}^\top (\boldsymbol{\chi} + \mathbf{T}_{\text{total}}) - (n + \nu) A(\boldsymbol{\eta})\right)

The posterior has the same form as the prior, with updated hyperparameters. This is exactly the conjugate update rule we saw for Beta-Binomial and Gaussian-Gaussian in the distributions article.

Generalized Linear Models (GLMs)

The exponential family is the foundation of Generalized Linear Models, which extend linear regression to non-Gaussian responses.

A GLM has three components:

  1. Random component: The response yy follows an exponential family distribution
  2. Systematic component: A linear predictor η=Xw\boldsymbol{\eta} = \mathbf{X}\mathbf{w}
  3. Link function: g(μ)=ηg(\mu) = \boldsymbol{\eta}, connecting the mean to the linear predictor

Common GLMs

Response TypeDistributionLinkName
ContinuousGaussianIdentity: g(μ)=μg(\mu) = \muLinear regression
BinaryBernoulliLogit: g(μ)=logμ1μg(\mu) = \log\frac{\mu}{1-\mu}Logistic regression
CountPoissonLog: g(μ)=logμg(\mu) = \log\muPoisson regression
Positive continuousGammaInverse: g(μ)=1/μg(\mu) = 1/\muGamma regression

The canonical link function is g=(A)1g = ({\nabla A})^{-1}, which maps the mean to the natural parameter. Using the canonical link simplifies the math and guarantees concave log-likelihoods.

Key insight: Logistic regression is not an ad hoc model — it’s the natural GLM for binary data. The sigmoid (logistic) function appears because it’s the inverse of the canonical link for the Bernoulli distribution.

Information Geometry

The exponential family has deep connections to information theory and differential geometry.

Fisher Information

For an exponential family distribution, the Fisher information matrix equals the Hessian of the log-partition function:

I(η)=2A(η)=Cov[T(x)]\mathbf{I}(\boldsymbol{\eta}) = \nabla^2 A(\boldsymbol{\eta}) = \text{Cov}[\mathbf{T}(x)]

The Fisher information measures how much information a sample carries about the parameters. It determines:

  • The Cramer-Rao lower bound on estimator variance
  • The asymptotic variance of MLE: Var(η^MLE)I(η)1/n\text{Var}(\hat{\eta}_{\text{MLE}}) \to \mathbf{I}(\boldsymbol{\eta})^{-1}/n
  • The geometry of the statistical manifold (the space of distributions)

KL Divergence

The KL divergence between two exponential family distributions has a simple form in terms of the log-partition function:

DKL(Pη1Pη2)=A(η2)A(η1)A(η1)(η2η1)D_{\text{KL}}(P_{\eta_1} \| P_{\eta_2}) = A(\boldsymbol{\eta}_2) - A(\boldsymbol{\eta}_1) - \nabla A(\boldsymbol{\eta}_1)^\top (\boldsymbol{\eta}_2 - \boldsymbol{\eta}_1)

This is the Bregman divergence associated with AA — a generalization of squared distance. This connection is why variational inference and natural gradient methods work efficiently for exponential family models.

Why This Matters for Deep Learning

Even in deep learning, the exponential family appears:

  • Output layers: The final layer of a neural network typically models an exponential family distribution — softmax for categorical (Multinomial), sigmoid for binary (Bernoulli), linear for continuous (Gaussian)
  • Loss functions: Cross-entropy and MSE loss are negative log-likelihoods of exponential family distributions
  • Natural gradient descent: Uses the Fisher information to precondition gradients, improving optimization in the space of distributions
  • Variational autoencoders: The reparameterization trick works cleanly for exponential family distributions because of their tractable moment-generating properties

Distributions Outside the Exponential Family

Not all useful distributions belong to the exponential family:

  • Uniform [a,b][a, b] with unknown endpoints — the support depends on the parameters
  • Student’s tt — used in robust statistics and hypothesis testing
  • Mixture distributions — mixtures of exponential family members are generally not in the family (this is what makes the EM algorithm necessary)

A distribution fails to be an exponential family member when its support depends on the parameter, or when it cannot be factored into the required form.

Summary

  • The exponential family provides a unified framework for most distributions used in ML
  • Its canonical form — h(x)exp(ηT(x)A(η))h(x) \exp(\boldsymbol{\eta}^\top \mathbf{T}(x) - A(\boldsymbol{\eta})) — reveals deep structure
  • Sufficient statistics compress data without losing information about parameters
  • The log-partition function A(η)A(\boldsymbol{\eta}) generates moments and guarantees convex MLE
  • Conjugate priors exist naturally for all exponential family members
  • GLMs extend linear regression to any exponential family response
  • Fisher information, KL divergence, and natural gradients all simplify for this family
  • Neural network output layers and loss functions are grounded in exponential family distributions

References

  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 2.4.
  • Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 3.4.
  • Wainwright, M. J., & Jordan, M. I. (2008). “Graphical Models, Exponential Families, and Variational Inference.” Foundations and Trends in Machine Learning, 1(1-2), 1—305.
  • McCullagh, P., & Nelder, J. A. (1989). Generalized Linear Models (2nd ed.). Chapman & Hall/CRC.
  • Brown, L. D. (1986). Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory. Institute of Mathematical Statistics.
  • Amari, S. (2016). Information Geometry and Its Applications. Springer.

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay