- 01 Probability Fundamentals 02 Random Variables and Expectation 03 Probability Distributions 04 The Exponential Family 05 Convergence and the Central Limit Theorem 06 Maximum Likelihood Estimation 07 MAP Estimation 08 The EM Algorithm 09 Hypothesis Testing 10 Nonparametric Statistics 11 Bayesian Inference 12 Probabilistic Graphical Models 13 Sampling Methods
From Events to Numbers
In the previous article, we defined probability over events like “rolling an even number.” But to do mathematics — compute averages, measure spread, fit models — we need to work with numbers, not sets.
A random variable is a function that maps outcomes from the sample space to real numbers. It bridges the world of abstract events and the world of quantitative analysis.
Discrete Random Variables
A discrete random variable takes on a countable set of values (finite or countably infinite).
Probability Mass Function (PMF)
The PMF gives the probability of each possible value:
Properties:
- for all
Example: For a fair die, = the number rolled. The PMF is for .
Common Discrete Random Variables
| Random Variable | PMF | Support |
|---|---|---|
| Bernoulli() | ||
| Binomial() | ||
| Poisson() | ||
| Geometric() |
We explore these distributions in depth in the distributions article.
Continuous Random Variables
A continuous random variable takes on any value in an interval (or all of ).
Probability Density Function (PDF)
The PDF describes the relative likelihood of values:
Properties:
- for all
Key insight: For continuous variables, . We can only assign probability to intervals. The PDF value is a density, not a probability — it can exceed 1.
From PMF to PDF
Think of the PDF as the continuous analog of the PMF. Where the PMF uses sums, the PDF uses integrals:
| Concept | Discrete | Continuous |
|---|---|---|
| Probability function | PMF: | PDF: |
| Probability of range | ||
| Normalization |
Cumulative Distribution Function (CDF)
The CDF works for both discrete and continuous random variables:
Properties:
- is non-decreasing
- and
- For continuous variables: (the PDF is the derivative of the CDF)
The CDF is useful for computing probabilities of intervals:
Expectation (Mean)
The expected value is the long-run average of a random variable:
Linearity of Expectation
The most useful property of expectation — it holds always, even for dependent variables:
Example: What is the expected number of heads in 100 coin flips? Each flip has . By linearity: . No need to compute the binomial distribution.
Law of the Unconscious Statistician (LOTUS)
To compute for a function , you don’t need the distribution of — just use the distribution of :
This is essential for computing moments like .
Variance and Standard Deviation
Variance measures how spread out a distribution is around its mean:
The second form (computational formula) is usually easier to calculate.
Standard deviation is the square root of variance, in the same units as :
Properties of Variance
Key insight: Variance scales with the square of the constant. This is why doubling a quantity quadruples its variance.
Joint Distributions
When we have two or more random variables, we need their joint distribution to describe how they relate.
Joint PMF (Discrete)
Joint PDF (Continuous)
Marginal Distributions
To get the distribution of one variable from the joint distribution, marginalize (sum or integrate) over the other:
Intuition: Marginalization “projects” the joint distribution onto one axis. If you have a scatter plot of , the marginal of is the histogram you get by ignoring the values.
Conditional Distributions
The distribution of given :
This is the continuous analog of conditional probability.
Covariance and Correlation
Covariance
Covariance measures how two variables move together:
- : and tend to increase together
- : One tends to increase when the other decreases
- : No linear relationship (but they may still be dependent!)
Warning: Zero covariance does NOT imply independence. Consider and . They are clearly dependent, but .
Correlation
Pearson correlation normalizes covariance to :
- : Perfect positive linear relationship
- : Perfect negative linear relationship
- : No linear relationship
The Covariance Matrix
For a random vector , the covariance matrix is:
The covariance matrix is always symmetric and positive semi-definite. It’s central to the Multivariate Gaussian and to PCA (Principal Component Analysis).
Transformations of Random Variables
If and we know the distribution of , how do we find the distribution of ?
Discrete Case
Simply map probabilities: .
Continuous Case (Change of Variables)
For a monotonic, differentiable function with inverse :
The absolute value of the Jacobian corrects for how stretches or compresses the probability.
Example: If and , then . This standardization is fundamental to the Central Limit Theorem.
Conditional Expectation
The conditional expectation is itself a random variable — it’s a function of :
Law of Total Expectation
This is incredibly powerful for computing expectations by conditioning on a simpler variable.
Law of Total Variance
In words: total variance = average within-group variance + variance of group means. This decomposition is the theoretical basis for ANOVA and the bias-variance tradeoff in ML.
In ML: Conditional expectation is what regression models estimate. When we fit , we’re estimating the conditional mean.
Summary
- Random variables map outcomes to numbers, enabling mathematical analysis
- PMFs describe discrete variables; PDFs describe continuous variables
- The CDF unifies both cases
- Expectation is linear regardless of dependence — the most useful property
- Variance measures spread; covariance measures co-movement
- Zero covariance does not imply independence
- The covariance matrix encodes all pairwise relationships in a random vector
- Conditional expectation is what regression models estimate
- Next: a deep dive into the key probability distributions
References
- Bertsekas, D. P., & Tsitsiklis, J. N. (2008). Introduction to Probability (2nd ed.). Athena Scientific. Chapters 2-4.
- Blitzstein, J. K., & Hwang, J. (2019). Introduction to Probability (2nd ed.). CRC Press.
- Ross, S. M. (2014). A First Course in Probability (9th ed.). Pearson. Chapters 4-6.
- Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury/Thomson Learning. Chapters 2-4.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 1.