Probability Distributions

A deep dive into the key distributions used in ML: Bernoulli, Binomial, Poisson, Gaussian, Exponential, Beta, and Multivariate Normal.

Probability & Statistics March 6, 2026 8 min read

Why Distributions Matter

Every probabilistic model in machine learning assumes some distribution over the data. Choosing the right distribution is choosing the right inductive bias — it tells the model what kind of patterns to expect.

In the random variables article, we introduced PMFs, PDFs, and the mechanics of working with random variables. Here we go deep: for each distribution, we cover its definition, parameters, properties, and where it appears in ML.

Bernoulli Distribution

The simplest distribution: a single trial with two outcomes.

XBernoulli(p)P(X=1)=p,P(X=0)=1pX \sim \text{Bernoulli}(p) \quad \Rightarrow \quad P(X = 1) = p, \quad P(X = 0) = 1 - p

Properties:

E[X]=pVar(X)=p(1p)\mathbb{E}[X] = p \qquad \text{Var}(X) = p(1 - p)

In ML: The output of binary classification. Logistic regression models each prediction as a Bernoulli random variable with p=σ(wx)p = \sigma(\mathbf{w}^\top \mathbf{x}).

Binomial Distribution

The sum of nn independent Bernoulli trials.

XBinomial(n,p)P(X=k)=(nk)pk(1p)nkX \sim \text{Binomial}(n, p) \quad \Rightarrow \quad P(X = k) = \binom{n}{k} p^k (1 - p)^{n - k}

Properties:

E[X]=npVar(X)=np(1p)\mathbb{E}[X] = np \qquad \text{Var}(X) = np(1 - p)

Example: If a spam filter has 90% accuracy and processes 100 emails, the number of correctly classified emails follows Binomial(100,0.9)\text{Binomial}(100, 0.9).

In ML: Model evaluation — counting correct predictions over a test set. Bootstrap sampling also relies on binomial-like resampling.

Poisson Distribution

Models the number of events in a fixed interval, given a constant average rate λ\lambda.

XPoisson(λ)P(X=k)=λkeλk!X \sim \text{Poisson}(\lambda) \quad \Rightarrow \quad P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}

Properties:

E[X]=λVar(X)=λ\mathbb{E}[X] = \lambda \qquad \text{Var}(X) = \lambda

The mean and variance being equal is a key signature. If your count data has variance much larger than its mean, the Poisson model is a poor fit (overdispersion).

Key insight: The Poisson distribution is the limit of the Binomial when nn \to \infty and p0p \to 0 while np=λnp = \lambda stays constant. It models rare events in large populations.

In ML: Count regression (Poisson regression), modeling word frequencies, event rate estimation, and anomaly detection on count data.

Uniform Distribution

Discrete Uniform

Every outcome is equally likely over a finite set {a,a+1,,b}\{a, a+1, \ldots, b\}:

P(X=k)=1ba+1P(X = k) = \frac{1}{b - a + 1}

Continuous Uniform

Equal probability density over an interval [a,b][a, b]:

f(x)=1bafor axbf(x) = \frac{1}{b - a} \quad \text{for } a \leq x \leq b

Properties:

E[X]=a+b2Var(X)=(ba)212\mathbb{E}[X] = \frac{a + b}{2} \qquad \text{Var}(X) = \frac{(b - a)^2}{12}

In ML: Random initialization of weights, random search for hyperparameters, and as a non-informative prior in Bayesian inference (a uniform prior says “all parameter values are equally plausible”).

Exponential Distribution

Models the time between events in a Poisson process.

XExponential(λ)f(x)=λeλxfor x0X \sim \text{Exponential}(\lambda) \quad \Rightarrow \quad f(x) = \lambda e^{-\lambda x} \quad \text{for } x \geq 0

Properties:

E[X]=1λVar(X)=1λ2\mathbb{E}[X] = \frac{1}{\lambda} \qquad \text{Var}(X) = \frac{1}{\lambda^2}

The exponential distribution is memoryless: P(X>s+tX>s)=P(X>t)P(X > s + t \mid X > s) = P(X > t). The probability of waiting another tt minutes is independent of how long you’ve already waited.

In ML: Modeling inter-arrival times, survival analysis, and as a prior for positive-valued parameters.

Gaussian (Normal) Distribution

The most important distribution in all of statistics and ML.

XN(μ,σ2)f(x)=1σ2πexp((xμ)22σ2)X \sim \mathcal{N}(\mu, \sigma^2) \quad \Rightarrow \quad f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)

Properties:

E[X]=μVar(X)=σ2\mathbb{E}[X] = \mu \qquad \text{Var}(X) = \sigma^2

Why the Gaussian is Everywhere

  1. Central Limit Theorem: The sum of many independent random variables converges to a Gaussian, regardless of the original distribution. This is explored in depth in the convergence article.

  2. Maximum entropy: Among all distributions with a given mean and variance, the Gaussian has the highest entropy. It is the “most uncertain” distribution under those constraints — making it the most conservative assumption.

  3. Analytical convenience: The Gaussian is closed under linear transformations, marginalization, and conditioning. This makes it the backbone of linear models, Kalman filters, and Gaussian processes.

The Standard Normal

When μ=0\mu = 0 and σ=1\sigma = 1:

Z=XμσN(0,1)Z = \frac{X - \mu}{\sigma} \sim \mathcal{N}(0, 1)

This standardization lets us compare values across different scales.

68-95-99.7 rule: About 68% of values fall within ±1σ\pm 1\sigma of the mean, 95% within ±2σ\pm 2\sigma, and 99.7% within ±3σ\pm 3\sigma.

In ML: Gaussian noise assumptions underpin linear regression, Gaussian Naive Bayes, Gaussian Mixture Models, variational autoencoders (VAEs), and the initialization of neural network weights.

Multivariate Gaussian

The generalization of the Gaussian to dd dimensions:

xN(μ,Σ)\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) f(x)=1(2π)d/2Σ1/2exp(12(xμ)Σ1(xμ))f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)

Where:

  • μRd\boldsymbol{\mu} \in \mathbb{R}^d is the mean vector
  • ΣRd×d\boldsymbol{\Sigma} \in \mathbb{R}^{d \times d} is the covariance matrix (symmetric, positive semi-definite)

Properties

The covariance matrix encodes both the spread (diagonal elements) and correlations (off-diagonal elements) between dimensions.

Three special cases:

  • Spherical: Σ=σ2I\boldsymbol{\Sigma} = \sigma^2 \mathbf{I} — equal variance in all directions, no correlation
  • Diagonal: Σ=diag(σ12,,σd2)\boldsymbol{\Sigma} = \text{diag}(\sigma_1^2, \ldots, \sigma_d^2) — different variances, no correlation
  • Full: arbitrary Σ\boldsymbol{\Sigma} — different variances and correlations

Conditional and Marginal

One of the most powerful properties: if x=[x1,x2]\mathbf{x} = [\mathbf{x}_1, \mathbf{x}_2]^\top is jointly Gaussian, then:

  • Marginals are Gaussian: x1N(μ1,Σ11)\mathbf{x}_1 \sim \mathcal{N}(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_{11})
  • Conditionals are Gaussian: x1x2N(μ12,Σ12)\mathbf{x}_1 \mid \mathbf{x}_2 \sim \mathcal{N}(\boldsymbol{\mu}_{1|2}, \boldsymbol{\Sigma}_{1|2})

This closure property is why Gaussian models are so tractable.

In ML: Gaussian Mixture Models (GMMs) for clustering, Gaussian Discriminant Analysis, Gaussian Processes, multivariate feature modeling, and the reparameterization trick in VAEs.

Beta Distribution

A distribution over probabilities — values in [0,1][0, 1].

XBeta(α,β)f(x)=xα1(1x)β1B(α,β)X \sim \text{Beta}(\alpha, \beta) \quad \Rightarrow \quad f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)}

where B(α,β)=Γ(α)Γ(β)Γ(α+β)B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)} is the Beta function.

Properties:

E[X]=αα+βVar(X)=αβ(α+β)2(α+β+1)\mathbb{E}[X] = \frac{\alpha}{\alpha + \beta} \qquad \text{Var}(X) = \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}

Shape Behavior

ParametersShapeInterpretation
α=β=1\alpha = \beta = 1UniformNo preference
α=β>1\alpha = \beta > 1Bell-shaped, centered at 0.5Preference for fair
α>β\alpha > \betaSkewed rightPreference for higher values
α<1,β<1\alpha < 1, \beta < 1U-shapedPreference for extremes

Conjugate Prior

The Beta is the conjugate prior for the Bernoulli/Binomial likelihood. If your prior is Beta(α,β)\text{Beta}(\alpha, \beta) and you observe kk successes in nn trials, the posterior is:

P(pdata)=Beta(α+k,β+nk)P(p \mid \text{data}) = \text{Beta}(\alpha + k, \beta + n - k)

This is beautifully simple: just add your observations to the prior counts. We use this extensively in MAP estimation and Bayesian inference.

In ML: Prior distributions for probabilities, Thompson sampling in bandits, Bayesian A/B testing, and Dirichlet-Multinomial models (the Dirichlet is the multivariate generalization of Beta).

Gamma Distribution

A distribution over positive real values, generalizing the Exponential.

XGamma(α,β)f(x)=βαΓ(α)xα1eβxfor x>0X \sim \text{Gamma}(\alpha, \beta) \quad \Rightarrow \quad f(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x} \quad \text{for } x > 0

Properties:

E[X]=αβVar(X)=αβ2\mathbb{E}[X] = \frac{\alpha}{\beta} \qquad \text{Var}(X) = \frac{\alpha}{\beta^2}

Note: When α=1\alpha = 1, the Gamma reduces to the Exponential with rate β\beta. The Gamma generalizes the Exponential to allow for more flexible shapes.

In ML: Conjugate prior for the precision (inverse variance) of a Gaussian. Used in Bayesian linear regression and Gamma regression for positive-valued targets.

Distribution Selection Guide

Data TypeDistributionExample
Binary outcomeBernoulliSpam / not spam
Count of successesBinomialCorrect predictions out of nn
Rare event countPoissonServer errors per hour
Time between eventsExponentialTime until next click
Continuous, symmetricGaussianMeasurement errors
Multi-dimensional continuousMultivariate GaussianFeature vectors
Probability parameterBetaClick-through rate prior
Positive continuousGammaWaiting times, precision

Relationships Between Distributions

The distributions form a rich family of connections:

  • Bernoulli(p)\text{Bernoulli}(p) is Binomial(1,p)\text{Binomial}(1, p)
  • Binomial(n,p)Poisson(λ)\text{Binomial}(n, p) \to \text{Poisson}(\lambda) as nn \to \infty, p0p \to 0, np=λnp = \lambda
  • Binomial(n,p)N(np,np(1p))\text{Binomial}(n, p) \to \mathcal{N}(np, np(1-p)) as nn \to \infty (Central Limit Theorem)
  • Exponential(λ)\text{Exponential}(\lambda) is Gamma(1,λ)\text{Gamma}(1, \lambda)
  • Beta(1,1)\text{Beta}(1, 1) is Uniform(0,1)\text{Uniform}(0, 1)
  • Sum of nn independent Exponential(λ)\text{Exponential}(\lambda) variables is Gamma(n,λ)\text{Gamma}(n, \lambda)

Understanding these connections helps you choose the right distribution and derive new results from known ones. We explore the Central Limit Theorem in depth in the next article.

Summary

  • Each distribution encodes specific assumptions about the data
  • Bernoulli/Binomial for binary/count outcomes, Poisson for rare events
  • The Gaussian dominates due to the Central Limit Theorem and maximum entropy
  • The Multivariate Gaussian extends to dd dimensions with a covariance matrix
  • Beta and Gamma serve as conjugate priors in Bayesian inference
  • Choosing the right distribution is choosing the right inductive bias for your model

References

  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 2.
  • Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapters 2-3.
  • Bertsekas, D. P., & Tsitsiklis, J. N. (2008). Introduction to Probability (2nd ed.). Athena Scientific.
  • Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury/Thomson Learning.
  • Blitzstein, J. K., & Hwang, J. (2019). Introduction to Probability (2nd ed.). CRC Press.

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay