Random Variables and Expectation

Probability & Statistics Series 2 / 13

From Events to Numbers

In the previous article, we defined probability over events like “rolling an even number.” But to do mathematics — compute averages, measure spread, fit models — we need to work with numbers, not sets.

A random variable is a function that maps outcomes from the sample space to real numbers. It bridges the world of abstract events and the world of quantitative analysis.

Discrete Random Variables

A discrete random variable takes on a countable set of values (finite or countably infinite).

Probability Mass Function (PMF)

The PMF gives the probability of each possible value:

p_X(x) = P(X = x)

Properties:

$p_X(x) \geq 0$ for all $x$
$\sum_{x} p_X(x) = 1$

Example: For a fair die, $X$ = the number rolled. The PMF is $p_X(k) = 1/6$ for $k \in \{1, 2, 3, 4, 5, 6\}$ .

Common Discrete Random Variables

Random Variable	PMF	Support
Bernoulli( $p$ )	$p^x(1-p)^{1-x}$	$\{0, 1\}$
Binomial( $n, p$ )	$\binom{n}{x}p^x(1-p)^{n-x}$	$\{0, 1, \ldots, n\}$
Poisson( $\lambda$ )	$\frac{\lambda^x e^{-\lambda}}{x!}$	$\{0, 1, 2, \ldots\}$
Geometric( $p$ )	$(1-p)^{x-1}p$	$\{1, 2, 3, \ldots\}$

We explore these distributions in depth in the distributions article.

Continuous Random Variables

A continuous random variable takes on any value in an interval (or all of $\mathbb{R}$ ).

Probability Density Function (PDF)

The PDF $f_X(x)$ describes the relative likelihood of values:

P(a \leq X \leq b) = \int_{a}^{b} f_X(x) \, dx

Properties:

$f_X(x) \geq 0$ for all $x$
$\int_{-\infty}^{\infty} f_X(x) \, dx = 1$

Key insight: For continuous variables, $P(X = \text{exact value}) = 0$ . We can only assign probability to intervals. The PDF value $f_X(x)$ is a density, not a probability — it can exceed 1.

From PMF to PDF

Think of the PDF as the continuous analog of the PMF. Where the PMF uses sums, the PDF uses integrals:

Concept	Discrete	Continuous
Probability function	PMF: $P(X = x)$	PDF: $f_X(x)$
Probability of range	$\sum_{x=a}^{b} P(X=x)$	$\int_a^b f_X(x) \, dx$
Normalization	$\sum_x P(X=x) = 1$	$\int f_X(x) \, dx = 1$

Cumulative Distribution Function (CDF)

The CDF works for both discrete and continuous random variables:

F_X(x) = P(X \leq x)

Properties:

$F_X$ is non-decreasing
$\lim_{x \to -\infty} F_X(x) = 0$ and $\lim_{x \to \infty} F_X(x) = 1$
For continuous variables: $f_X(x) = F_X'(x)$ (the PDF is the derivative of the CDF)

The CDF is useful for computing probabilities of intervals:

P(a < X \leq b) = F_X(b) - F_X(a)

Expectation (Mean)

The expected value is the long-run average of a random variable:

\mathbb{E}[X] = \sum_{x} x \cdot P(X = x) \quad \text{(discrete)}

\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot f_X(x) \, dx \quad \text{(continuous)}

Linearity of Expectation

The most useful property of expectation — it holds always, even for dependent variables:

\mathbb{E}[aX + bY + c] = a\mathbb{E}[X] + b\mathbb{E}[Y] + c

Example: What is the expected number of heads in 100 coin flips? Each flip has $\mathbb{E}[X_i] = 0.5$ . By linearity: $\mathbb{E}[\sum X_i] = \sum \mathbb{E}[X_i] = 100 \cdot 0.5 = 50$ . No need to compute the binomial distribution.

Law of the Unconscious Statistician (LOTUS)

To compute $\mathbb{E}[g(X)]$ for a function $g$ , you don’t need the distribution of $g(X)$ — just use the distribution of $X$ :

\mathbb{E}[g(X)] = \sum_{x} g(x) \cdot P(X = x) \quad \text{(discrete)}

\mathbb{E}[g(X)] = \int_{-\infty}^{\infty} g(x) \cdot f_X(x) \, dx \quad \text{(continuous)}

This is essential for computing moments like $\mathbb{E}[X^2]$ .

Variance and Standard Deviation

Variance measures how spread out a distribution is around its mean:

\text{Var}(X) = \mathbb{E}\!\left[(X - \mathbb{E}[X])^2\right] = \mathbb{E}[X^2] - \left(\mathbb{E}[X]\right)^2

The second form (computational formula) is usually easier to calculate.

Standard deviation is the square root of variance, in the same units as $X$ :

\text{SD}(X) = \sigma_X = \sqrt{\text{Var}(X)}

Properties of Variance

\begin{aligned} \text{Var}(aX + b) &= a^2 \, \text{Var}(X) \\[4pt] \text{Var}(X + Y) &= \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y) \\[4pt] \text{Var}(X + Y) &= \text{Var}(X) + \text{Var}(Y) \quad \text{(if independent)} \end{aligned}

Key insight: Variance scales with the square of the constant. This is why doubling a quantity quadruples its variance.

Joint Distributions

When we have two or more random variables, we need their joint distribution to describe how they relate.

Joint PMF (Discrete)

p_{X,Y}(x, y) = P(X = x, Y = y)

Joint PDF (Continuous)

P(X \in A, Y \in B) = \int_A \int_B f_{X,Y}(x, y) \, dy \, dx

Marginal Distributions

To get the distribution of one variable from the joint distribution, marginalize (sum or integrate) over the other:

p_X(x) = \sum_{y} p_{X,Y}(x, y) \quad \text{(discrete)}

f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y) \, dy \quad \text{(continuous)}

Intuition: Marginalization “projects” the joint distribution onto one axis. If you have a scatter plot of $(X, Y)$ , the marginal of $X$ is the histogram you get by ignoring the $Y$ values.

Conditional Distributions

The distribution of $X$ given $Y = y$ :

f_{X \mid Y}(x \mid y) = \frac{f_{X,Y}(x, y)}{f_Y(y)}

This is the continuous analog of conditional probability.

Covariance and Correlation

Covariance

Covariance measures how two variables move together:

\text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]

$\text{Cov}(X, Y) > 0$ : $X$ and $Y$ tend to increase together
$\text{Cov}(X, Y) < 0$ : One tends to increase when the other decreases
$\text{Cov}(X, Y) = 0$ : No linear relationship (but they may still be dependent!)

Warning: Zero covariance does NOT imply independence. Consider $X \sim \text{Uniform}(-1, 1)$ and $Y = X^2$ . They are clearly dependent, but $\text{Cov}(X, Y) = 0$ .

Correlation

Pearson correlation normalizes covariance to $[-1, 1]$ :

\rho_{X,Y} = \frac{\text{Cov}(X, Y)}{\sigma_X \cdot \sigma_Y}

$\rho = 1$ : Perfect positive linear relationship
$\rho = -1$ : Perfect negative linear relationship
$\rho = 0$ : No linear relationship

The Covariance Matrix

For a random vector $\mathbf{X} = [X_1, X_2, \ldots, X_d]^\top$ , the covariance matrix is:

\boldsymbol{\Sigma} = \text{Cov}(\mathbf{X}) = \mathbb{E}[(\mathbf{X} - \boldsymbol{\mu})(\mathbf{X} - \boldsymbol{\mu})^\top]

\boldsymbol{\Sigma} = \begin{bmatrix} \text{Var}(X_1) & \text{Cov}(X_1, X_2) & \cdots & \text{Cov}(X_1, X_d) \\ \text{Cov}(X_2, X_1) & \text{Var}(X_2) & \cdots & \text{Cov}(X_2, X_d) \\ \vdots & \vdots & \ddots & \vdots \\ \text{Cov}(X_d, X_1) & \text{Cov}(X_d, X_2) & \cdots & \text{Var}(X_d) \end{bmatrix}

The covariance matrix is always symmetric and positive semi-definite. It’s central to the Multivariate Gaussian and to PCA (Principal Component Analysis).

Transformations of Random Variables

If $Y = g(X)$ and we know the distribution of $X$ , how do we find the distribution of $Y$ ?

Discrete Case

Simply map probabilities: $P(Y = y) = P(g(X) = y) = \sum_{x: g(x) = y} P(X = x)$ .

Continuous Case (Change of Variables)

For a monotonic, differentiable function $g$ with inverse $g^{-1}$ :

f_Y(y) = f_X(g^{-1}(y)) \cdot \left|\frac{d}{dy} g^{-1}(y)\right|

The absolute value of the Jacobian corrects for how $g$ stretches or compresses the probability.

Example: If $X \sim \mathcal{N}(\mu, \sigma^2)$ and $Z = (X - \mu)/\sigma$ , then $Z \sim \mathcal{N}(0, 1)$ . This standardization is fundamental to the Central Limit Theorem.

Conditional Expectation

The conditional expectation $\mathbb{E}[X \mid Y]$ is itself a random variable — it’s a function of $Y$ :

\mathbb{E}[X \mid Y = y] = \int x \cdot f_{X \mid Y}(x \mid y) \, dx

Law of Total Expectation

\mathbb{E}[X] = \mathbb{E}[\mathbb{E}[X \mid Y]]

This is incredibly powerful for computing expectations by conditioning on a simpler variable.

Law of Total Variance

\text{Var}(X) = \mathbb{E}[\text{Var}(X \mid Y)] + \text{Var}(\mathbb{E}[X \mid Y])

In words: total variance = average within-group variance + variance of group means. This decomposition is the theoretical basis for ANOVA and the bias-variance tradeoff in ML.

In ML: Conditional expectation is what regression models estimate. When we fit $f(x) = \mathbb{E}[Y \mid X = x]$ , we’re estimating the conditional mean.

Summary

Random variables map outcomes to numbers, enabling mathematical analysis
PMFs describe discrete variables; PDFs describe continuous variables
The CDF $F(x) = P(X \leq x)$ unifies both cases
Expectation is linear regardless of dependence — the most useful property
Variance measures spread; covariance measures co-movement
Zero covariance does not imply independence
The covariance matrix encodes all pairwise relationships in a random vector
Conditional expectation is what regression models estimate
Next: a deep dive into the key probability distributions

References

Bertsekas, D. P., & Tsitsiklis, J. N. (2008). Introduction to Probability (2nd ed.). Athena Scientific. Chapters 2-4.
Blitzstein, J. K., & Hwang, J. (2019). Introduction to Probability (2nd ed.). CRC Press.
Ross, S. M. (2014). A First Course in Probability (9th ed.). Pearson. Chapters 4-6.
Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury/Thomson Learning. Chapters 2-4.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 1.

Random Variables and Expectation

From Events to Numbers

Discrete Random Variables

Probability Mass Function (PMF)

Common Discrete Random Variables

Continuous Random Variables

Probability Density Function (PDF)

From PMF to PDF

Cumulative Distribution Function (CDF)

Expectation (Mean)

Linearity of Expectation

Law of the Unconscious Statistician (LOTUS)

Variance and Standard Deviation

Properties of Variance

Joint Distributions

Joint PMF (Discrete)

Joint PDF (Continuous)

Marginal Distributions

Conditional Distributions

Covariance and Correlation

Covariance

Correlation

The Covariance Matrix

Transformations of Random Variables

Discrete Case

Continuous Case (Change of Variables)

Conditional Expectation

Law of Total Expectation

Law of Total Variance

Summary

References

Keyboard Shortcuts