Probability Distributions

Probability & Statistics Series 3 / 13

Why Distributions Matter

Every probabilistic model in machine learning assumes some distribution over the data. Choosing the right distribution is choosing the right inductive bias — it tells the model what kind of patterns to expect.

In the random variables article, we introduced PMFs, PDFs, and the mechanics of working with random variables. Here we go deep: for each distribution, we cover its definition, parameters, properties, and where it appears in ML.

Bernoulli Distribution

The simplest distribution: a single trial with two outcomes.

X \sim \text{Bernoulli}(p) \quad \Rightarrow \quad P(X = 1) = p, \quad P(X = 0) = 1 - p

Properties:

\mathbb{E}[X] = p \qquad \text{Var}(X) = p(1 - p)

In ML: The output of binary classification. Logistic regression models each prediction as a Bernoulli random variable with $p = \sigma(\mathbf{w}^\top \mathbf{x})$ .

Binomial Distribution

The sum of $n$ independent Bernoulli trials.

X \sim \text{Binomial}(n, p) \quad \Rightarrow \quad P(X = k) = \binom{n}{k} p^k (1 - p)^{n - k}

Properties:

\mathbb{E}[X] = np \qquad \text{Var}(X) = np(1 - p)

Example: If a spam filter has 90% accuracy and processes 100 emails, the number of correctly classified emails follows $\text{Binomial}(100, 0.9)$ .

In ML: Model evaluation — counting correct predictions over a test set. Bootstrap sampling also relies on binomial-like resampling.

Poisson Distribution

Models the number of events in a fixed interval, given a constant average rate $\lambda$ .

X \sim \text{Poisson}(\lambda) \quad \Rightarrow \quad P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}

Properties:

\mathbb{E}[X] = \lambda \qquad \text{Var}(X) = \lambda

The mean and variance being equal is a key signature. If your count data has variance much larger than its mean, the Poisson model is a poor fit (overdispersion).

Key insight: The Poisson distribution is the limit of the Binomial when $n \to \infty$ and $p \to 0$ while $np = \lambda$ stays constant. It models rare events in large populations.

In ML: Count regression (Poisson regression), modeling word frequencies, event rate estimation, and anomaly detection on count data.

Uniform Distribution

Discrete Uniform

Every outcome is equally likely over a finite set $\{a, a+1, \ldots, b\}$ :

P(X = k) = \frac{1}{b - a + 1}

Continuous Uniform

Equal probability density over an interval $[a, b]$ :

f(x) = \frac{1}{b - a} \quad \text{for } a \leq x \leq b

Properties:

\mathbb{E}[X] = \frac{a + b}{2} \qquad \text{Var}(X) = \frac{(b - a)^2}{12}

In ML: Random initialization of weights, random search for hyperparameters, and as a non-informative prior in Bayesian inference (a uniform prior says “all parameter values are equally plausible”).

Exponential Distribution

Models the time between events in a Poisson process.

X \sim \text{Exponential}(\lambda) \quad \Rightarrow \quad f(x) = \lambda e^{-\lambda x} \quad \text{for } x \geq 0

Properties:

\mathbb{E}[X] = \frac{1}{\lambda} \qquad \text{Var}(X) = \frac{1}{\lambda^2}

The exponential distribution is memoryless: $P(X > s + t \mid X > s) = P(X > t)$ . The probability of waiting another $t$ minutes is independent of how long you’ve already waited.

In ML: Modeling inter-arrival times, survival analysis, and as a prior for positive-valued parameters.

Gaussian (Normal) Distribution

The most important distribution in all of statistics and ML.

X \sim \mathcal{N}(\mu, \sigma^2) \quad \Rightarrow \quad f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)

Properties:

\mathbb{E}[X] = \mu \qquad \text{Var}(X) = \sigma^2

Why the Gaussian is Everywhere

Central Limit Theorem: The sum of many independent random variables converges to a Gaussian, regardless of the original distribution. This is explored in depth in the convergence article.
Maximum entropy: Among all distributions with a given mean and variance, the Gaussian has the highest entropy. It is the “most uncertain” distribution under those constraints — making it the most conservative assumption.
Analytical convenience: The Gaussian is closed under linear transformations, marginalization, and conditioning. This makes it the backbone of linear models, Kalman filters, and Gaussian processes.

The Standard Normal

When $\mu = 0$ and $\sigma = 1$ :

Z = \frac{X - \mu}{\sigma} \sim \mathcal{N}(0, 1)

This standardization lets us compare values across different scales.

68-95-99.7 rule: About 68% of values fall within $\pm 1\sigma$ of the mean, 95% within $\pm 2\sigma$ , and 99.7% within $\pm 3\sigma$ .

In ML: Gaussian noise assumptions underpin linear regression, Gaussian Naive Bayes, Gaussian Mixture Models, variational autoencoders (VAEs), and the initialization of neural network weights.

Multivariate Gaussian

The generalization of the Gaussian to $d$ dimensions:

\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})

f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)

Where:

$\boldsymbol{\mu} \in \mathbb{R}^d$ is the mean vector
$\boldsymbol{\Sigma} \in \mathbb{R}^{d \times d}$ is the covariance matrix (symmetric, positive semi-definite)

Properties

The covariance matrix encodes both the spread (diagonal elements) and correlations (off-diagonal elements) between dimensions.

Three special cases:

Spherical: $\boldsymbol{\Sigma} = \sigma^2 \mathbf{I}$ — equal variance in all directions, no correlation
Diagonal: $\boldsymbol{\Sigma} = \text{diag}(\sigma_1^2, \ldots, \sigma_d^2)$ — different variances, no correlation
Full: arbitrary $\boldsymbol{\Sigma}$ — different variances and correlations

Conditional and Marginal

One of the most powerful properties: if $\mathbf{x} = [\mathbf{x}_1, \mathbf{x}_2]^\top$ is jointly Gaussian, then:

Marginals are Gaussian: $\mathbf{x}_1 \sim \mathcal{N}(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_{11})$
Conditionals are Gaussian: $\mathbf{x}_1 \mid \mathbf{x}_2 \sim \mathcal{N}(\boldsymbol{\mu}_{1|2}, \boldsymbol{\Sigma}_{1|2})$

This closure property is why Gaussian models are so tractable.

In ML: Gaussian Mixture Models (GMMs) for clustering, Gaussian Discriminant Analysis, Gaussian Processes, multivariate feature modeling, and the reparameterization trick in VAEs.

Beta Distribution

A distribution over probabilities — values in $[0, 1]$ .

X \sim \text{Beta}(\alpha, \beta) \quad \Rightarrow \quad f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)}

where $B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)}$ is the Beta function.

Properties:

\mathbb{E}[X] = \frac{\alpha}{\alpha + \beta} \qquad \text{Var}(X) = \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}

Shape Behavior

Parameters	Shape	Interpretation
$\alpha = \beta = 1$	Uniform	No preference
$\alpha = \beta > 1$	Bell-shaped, centered at 0.5	Preference for fair
$\alpha > \beta$	Skewed right	Preference for higher values
$\alpha < 1, \beta < 1$	U-shaped	Preference for extremes

Conjugate Prior

The Beta is the conjugate prior for the Bernoulli/Binomial likelihood. If your prior is $\text{Beta}(\alpha, \beta)$ and you observe $k$ successes in $n$ trials, the posterior is:

P(p \mid \text{data}) = \text{Beta}(\alpha + k, \beta + n - k)

This is beautifully simple: just add your observations to the prior counts. We use this extensively in MAP estimation and Bayesian inference.

In ML: Prior distributions for probabilities, Thompson sampling in bandits, Bayesian A/B testing, and Dirichlet-Multinomial models (the Dirichlet is the multivariate generalization of Beta).

Gamma Distribution

A distribution over positive real values, generalizing the Exponential.

X \sim \text{Gamma}(\alpha, \beta) \quad \Rightarrow \quad f(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x} \quad \text{for } x > 0

Properties:

\mathbb{E}[X] = \frac{\alpha}{\beta} \qquad \text{Var}(X) = \frac{\alpha}{\beta^2}

Note: When $\alpha = 1$ , the Gamma reduces to the Exponential with rate $\beta$ . The Gamma generalizes the Exponential to allow for more flexible shapes.

In ML: Conjugate prior for the precision (inverse variance) of a Gaussian. Used in Bayesian linear regression and Gamma regression for positive-valued targets.

Distribution Selection Guide

Data Type	Distribution	Example
Binary outcome	Bernoulli	Spam / not spam
Count of successes	Binomial	Correct predictions out of $n$
Rare event count	Poisson	Server errors per hour
Time between events	Exponential	Time until next click
Continuous, symmetric	Gaussian	Measurement errors
Multi-dimensional continuous	Multivariate Gaussian	Feature vectors
Probability parameter	Beta	Click-through rate prior
Positive continuous	Gamma	Waiting times, precision

Relationships Between Distributions

The distributions form a rich family of connections:

$\text{Bernoulli}(p)$ is $\text{Binomial}(1, p)$
$\text{Binomial}(n, p) \to \text{Poisson}(\lambda)$ as $n \to \infty$ , $p \to 0$ , $np = \lambda$
$\text{Binomial}(n, p) \to \mathcal{N}(np, np(1-p))$ as $n \to \infty$ (Central Limit Theorem)
$\text{Exponential}(\lambda)$ is $\text{Gamma}(1, \lambda)$
$\text{Beta}(1, 1)$ is $\text{Uniform}(0, 1)$
Sum of $n$ independent $\text{Exponential}(\lambda)$ variables is $\text{Gamma}(n, \lambda)$

Understanding these connections helps you choose the right distribution and derive new results from known ones. We explore the Central Limit Theorem in depth in the next article.

Summary

Each distribution encodes specific assumptions about the data
Bernoulli/Binomial for binary/count outcomes, Poisson for rare events
The Gaussian dominates due to the Central Limit Theorem and maximum entropy
The Multivariate Gaussian extends to $d$ dimensions with a covariance matrix
Beta and Gamma serve as conjugate priors in Bayesian inference
Choosing the right distribution is choosing the right inductive bias for your model

References

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 2.
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapters 2-3.
Bertsekas, D. P., & Tsitsiklis, J. N. (2008). Introduction to Probability (2nd ed.). Athena Scientific.
Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury/Thomson Learning.
Blitzstein, J. K., & Hwang, J. (2019). Introduction to Probability (2nd ed.). CRC Press.

Probability Distributions

Why Distributions Matter

Bernoulli Distribution

Binomial Distribution

Poisson Distribution

Uniform Distribution

Discrete Uniform

Continuous Uniform

Exponential Distribution

Gaussian (Normal) Distribution

Why the Gaussian is Everywhere

The Standard Normal

Multivariate Gaussian

Properties

Conditional and Marginal

Beta Distribution

Shape Behavior

Conjugate Prior

Gamma Distribution

Distribution Selection Guide

Relationships Between Distributions

Summary

References

Keyboard Shortcuts