- 01 Probability Fundamentals 02 Random Variables and Expectation 03 Probability Distributions 04 The Exponential Family 05 Convergence and the Central Limit Theorem 06 Maximum Likelihood Estimation 07 MAP Estimation 08 The EM Algorithm 09 Hypothesis Testing 10 Nonparametric Statistics 11 Bayesian Inference 12 Probabilistic Graphical Models 13 Sampling Methods
Why Distributions Matter
Every probabilistic model in machine learning assumes some distribution over the data. Choosing the right distribution is choosing the right inductive bias — it tells the model what kind of patterns to expect.
In the random variables article, we introduced PMFs, PDFs, and the mechanics of working with random variables. Here we go deep: for each distribution, we cover its definition, parameters, properties, and where it appears in ML.
Bernoulli Distribution
The simplest distribution: a single trial with two outcomes.
Properties:
In ML: The output of binary classification. Logistic regression models each prediction as a Bernoulli random variable with .
Binomial Distribution
The sum of independent Bernoulli trials.
Properties:
Example: If a spam filter has 90% accuracy and processes 100 emails, the number of correctly classified emails follows .
In ML: Model evaluation — counting correct predictions over a test set. Bootstrap sampling also relies on binomial-like resampling.
Poisson Distribution
Models the number of events in a fixed interval, given a constant average rate .
Properties:
The mean and variance being equal is a key signature. If your count data has variance much larger than its mean, the Poisson model is a poor fit (overdispersion).
Key insight: The Poisson distribution is the limit of the Binomial when and while stays constant. It models rare events in large populations.
In ML: Count regression (Poisson regression), modeling word frequencies, event rate estimation, and anomaly detection on count data.
Uniform Distribution
Discrete Uniform
Every outcome is equally likely over a finite set :
Continuous Uniform
Equal probability density over an interval :
Properties:
In ML: Random initialization of weights, random search for hyperparameters, and as a non-informative prior in Bayesian inference (a uniform prior says “all parameter values are equally plausible”).
Exponential Distribution
Models the time between events in a Poisson process.
Properties:
The exponential distribution is memoryless: . The probability of waiting another minutes is independent of how long you’ve already waited.
In ML: Modeling inter-arrival times, survival analysis, and as a prior for positive-valued parameters.
Gaussian (Normal) Distribution
The most important distribution in all of statistics and ML.
Properties:
Why the Gaussian is Everywhere
-
Central Limit Theorem: The sum of many independent random variables converges to a Gaussian, regardless of the original distribution. This is explored in depth in the convergence article.
-
Maximum entropy: Among all distributions with a given mean and variance, the Gaussian has the highest entropy. It is the “most uncertain” distribution under those constraints — making it the most conservative assumption.
-
Analytical convenience: The Gaussian is closed under linear transformations, marginalization, and conditioning. This makes it the backbone of linear models, Kalman filters, and Gaussian processes.
The Standard Normal
When and :
This standardization lets us compare values across different scales.
68-95-99.7 rule: About 68% of values fall within of the mean, 95% within , and 99.7% within .
In ML: Gaussian noise assumptions underpin linear regression, Gaussian Naive Bayes, Gaussian Mixture Models, variational autoencoders (VAEs), and the initialization of neural network weights.
Multivariate Gaussian
The generalization of the Gaussian to dimensions:
Where:
- is the mean vector
- is the covariance matrix (symmetric, positive semi-definite)
Properties
The covariance matrix encodes both the spread (diagonal elements) and correlations (off-diagonal elements) between dimensions.
Three special cases:
- Spherical: — equal variance in all directions, no correlation
- Diagonal: — different variances, no correlation
- Full: arbitrary — different variances and correlations
Conditional and Marginal
One of the most powerful properties: if is jointly Gaussian, then:
- Marginals are Gaussian:
- Conditionals are Gaussian:
This closure property is why Gaussian models are so tractable.
In ML: Gaussian Mixture Models (GMMs) for clustering, Gaussian Discriminant Analysis, Gaussian Processes, multivariate feature modeling, and the reparameterization trick in VAEs.
Beta Distribution
A distribution over probabilities — values in .
where is the Beta function.
Properties:
Shape Behavior
| Parameters | Shape | Interpretation |
|---|---|---|
| Uniform | No preference | |
| Bell-shaped, centered at 0.5 | Preference for fair | |
| Skewed right | Preference for higher values | |
| U-shaped | Preference for extremes |
Conjugate Prior
The Beta is the conjugate prior for the Bernoulli/Binomial likelihood. If your prior is and you observe successes in trials, the posterior is:
This is beautifully simple: just add your observations to the prior counts. We use this extensively in MAP estimation and Bayesian inference.
In ML: Prior distributions for probabilities, Thompson sampling in bandits, Bayesian A/B testing, and Dirichlet-Multinomial models (the Dirichlet is the multivariate generalization of Beta).
Gamma Distribution
A distribution over positive real values, generalizing the Exponential.
Properties:
Note: When , the Gamma reduces to the Exponential with rate . The Gamma generalizes the Exponential to allow for more flexible shapes.
In ML: Conjugate prior for the precision (inverse variance) of a Gaussian. Used in Bayesian linear regression and Gamma regression for positive-valued targets.
Distribution Selection Guide
| Data Type | Distribution | Example |
|---|---|---|
| Binary outcome | Bernoulli | Spam / not spam |
| Count of successes | Binomial | Correct predictions out of |
| Rare event count | Poisson | Server errors per hour |
| Time between events | Exponential | Time until next click |
| Continuous, symmetric | Gaussian | Measurement errors |
| Multi-dimensional continuous | Multivariate Gaussian | Feature vectors |
| Probability parameter | Beta | Click-through rate prior |
| Positive continuous | Gamma | Waiting times, precision |
Relationships Between Distributions
The distributions form a rich family of connections:
- is
- as , ,
- as (Central Limit Theorem)
- is
- is
- Sum of independent variables is
Understanding these connections helps you choose the right distribution and derive new results from known ones. We explore the Central Limit Theorem in depth in the next article.
Summary
- Each distribution encodes specific assumptions about the data
- Bernoulli/Binomial for binary/count outcomes, Poisson for rare events
- The Gaussian dominates due to the Central Limit Theorem and maximum entropy
- The Multivariate Gaussian extends to dimensions with a covariance matrix
- Beta and Gamma serve as conjugate priors in Bayesian inference
- Choosing the right distribution is choosing the right inductive bias for your model
References
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 2.
- Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapters 2-3.
- Bertsekas, D. P., & Tsitsiklis, J. N. (2008). Introduction to Probability (2nd ed.). Athena Scientific.
- Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury/Thomson Learning.
- Blitzstein, J. K., & Hwang, J. (2019). Introduction to Probability (2nd ed.). CRC Press.