Random Variables and Expectation

Discrete and continuous random variables, PMF, PDF, CDF, expectation, variance, covariance, and joint distributions explained.

Probability & Statistics February 18, 2026 8 min read

From Events to Numbers

In the previous article, we defined probability over events like “rolling an even number.” But to do mathematics — compute averages, measure spread, fit models — we need to work with numbers, not sets.

A random variable is a function that maps outcomes from the sample space to real numbers. It bridges the world of abstract events and the world of quantitative analysis.

Discrete Random Variables

A discrete random variable takes on a countable set of values (finite or countably infinite).

Probability Mass Function (PMF)

The PMF gives the probability of each possible value:

pX(x)=P(X=x)p_X(x) = P(X = x)

Properties:

  • pX(x)0p_X(x) \geq 0 for all xx
  • xpX(x)=1\sum_{x} p_X(x) = 1

Example: For a fair die, XX = the number rolled. The PMF is pX(k)=1/6p_X(k) = 1/6 for k{1,2,3,4,5,6}k \in \{1, 2, 3, 4, 5, 6\}.

Common Discrete Random Variables

Random VariablePMFSupport
Bernoulli(pp)px(1p)1xp^x(1-p)^{1-x}{0,1}\{0, 1\}
Binomial(n,pn, p)(nx)px(1p)nx\binom{n}{x}p^x(1-p)^{n-x}{0,1,,n}\{0, 1, \ldots, n\}
Poisson(λ\lambda)λxeλx!\frac{\lambda^x e^{-\lambda}}{x!}{0,1,2,}\{0, 1, 2, \ldots\}
Geometric(pp)(1p)x1p(1-p)^{x-1}p{1,2,3,}\{1, 2, 3, \ldots\}

We explore these distributions in depth in the distributions article.

Continuous Random Variables

A continuous random variable takes on any value in an interval (or all of R\mathbb{R}).

Probability Density Function (PDF)

The PDF fX(x)f_X(x) describes the relative likelihood of values:

P(aXb)=abfX(x)dxP(a \leq X \leq b) = \int_{a}^{b} f_X(x) \, dx

Properties:

  • fX(x)0f_X(x) \geq 0 for all xx
  • fX(x)dx=1\int_{-\infty}^{\infty} f_X(x) \, dx = 1

Key insight: For continuous variables, P(X=exact value)=0P(X = \text{exact value}) = 0. We can only assign probability to intervals. The PDF value fX(x)f_X(x) is a density, not a probability — it can exceed 1.

From PMF to PDF

Think of the PDF as the continuous analog of the PMF. Where the PMF uses sums, the PDF uses integrals:

ConceptDiscreteContinuous
Probability functionPMF: P(X=x)P(X = x)PDF: fX(x)f_X(x)
Probability of rangex=abP(X=x)\sum_{x=a}^{b} P(X=x)abfX(x)dx\int_a^b f_X(x) \, dx
NormalizationxP(X=x)=1\sum_x P(X=x) = 1fX(x)dx=1\int f_X(x) \, dx = 1

Cumulative Distribution Function (CDF)

The CDF works for both discrete and continuous random variables:

FX(x)=P(Xx)F_X(x) = P(X \leq x)

Properties:

  • FXF_X is non-decreasing
  • limxFX(x)=0\lim_{x \to -\infty} F_X(x) = 0 and limxFX(x)=1\lim_{x \to \infty} F_X(x) = 1
  • For continuous variables: fX(x)=FX(x)f_X(x) = F_X'(x) (the PDF is the derivative of the CDF)

The CDF is useful for computing probabilities of intervals:

P(a<Xb)=FX(b)FX(a)P(a < X \leq b) = F_X(b) - F_X(a)

Expectation (Mean)

The expected value is the long-run average of a random variable:

E[X]=xxP(X=x)(discrete)\mathbb{E}[X] = \sum_{x} x \cdot P(X = x) \quad \text{(discrete)} E[X]=xfX(x)dx(continuous)\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot f_X(x) \, dx \quad \text{(continuous)}

Linearity of Expectation

The most useful property of expectation — it holds always, even for dependent variables:

E[aX+bY+c]=aE[X]+bE[Y]+c\mathbb{E}[aX + bY + c] = a\mathbb{E}[X] + b\mathbb{E}[Y] + c

Example: What is the expected number of heads in 100 coin flips? Each flip has E[Xi]=0.5\mathbb{E}[X_i] = 0.5. By linearity: E[Xi]=E[Xi]=1000.5=50\mathbb{E}[\sum X_i] = \sum \mathbb{E}[X_i] = 100 \cdot 0.5 = 50. No need to compute the binomial distribution.

Law of the Unconscious Statistician (LOTUS)

To compute E[g(X)]\mathbb{E}[g(X)] for a function gg, you don’t need the distribution of g(X)g(X) — just use the distribution of XX:

E[g(X)]=xg(x)P(X=x)(discrete)\mathbb{E}[g(X)] = \sum_{x} g(x) \cdot P(X = x) \quad \text{(discrete)} E[g(X)]=g(x)fX(x)dx(continuous)\mathbb{E}[g(X)] = \int_{-\infty}^{\infty} g(x) \cdot f_X(x) \, dx \quad \text{(continuous)}

This is essential for computing moments like E[X2]\mathbb{E}[X^2].

Variance and Standard Deviation

Variance measures how spread out a distribution is around its mean:

Var(X)=E ⁣[(XE[X])2]=E[X2](E[X])2\text{Var}(X) = \mathbb{E}\!\left[(X - \mathbb{E}[X])^2\right] = \mathbb{E}[X^2] - \left(\mathbb{E}[X]\right)^2

The second form (computational formula) is usually easier to calculate.

Standard deviation is the square root of variance, in the same units as XX:

SD(X)=σX=Var(X)\text{SD}(X) = \sigma_X = \sqrt{\text{Var}(X)}

Properties of Variance

Var(aX+b)=a2Var(X)Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)Var(X+Y)=Var(X)+Var(Y)(if independent)\begin{aligned} \text{Var}(aX + b) &= a^2 \, \text{Var}(X) \\[4pt] \text{Var}(X + Y) &= \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y) \\[4pt] \text{Var}(X + Y) &= \text{Var}(X) + \text{Var}(Y) \quad \text{(if independent)} \end{aligned}

Key insight: Variance scales with the square of the constant. This is why doubling a quantity quadruples its variance.

Joint Distributions

When we have two or more random variables, we need their joint distribution to describe how they relate.

Joint PMF (Discrete)

pX,Y(x,y)=P(X=x,Y=y)p_{X,Y}(x, y) = P(X = x, Y = y)

Joint PDF (Continuous)

P(XA,YB)=ABfX,Y(x,y)dydxP(X \in A, Y \in B) = \int_A \int_B f_{X,Y}(x, y) \, dy \, dx

Marginal Distributions

To get the distribution of one variable from the joint distribution, marginalize (sum or integrate) over the other:

pX(x)=ypX,Y(x,y)(discrete)p_X(x) = \sum_{y} p_{X,Y}(x, y) \quad \text{(discrete)} fX(x)=fX,Y(x,y)dy(continuous)f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y) \, dy \quad \text{(continuous)}

Intuition: Marginalization “projects” the joint distribution onto one axis. If you have a scatter plot of (X,Y)(X, Y), the marginal of XX is the histogram you get by ignoring the YY values.

Conditional Distributions

The distribution of XX given Y=yY = y:

fXY(xy)=fX,Y(x,y)fY(y)f_{X \mid Y}(x \mid y) = \frac{f_{X,Y}(x, y)}{f_Y(y)}

This is the continuous analog of conditional probability.

Covariance and Correlation

Covariance

Covariance measures how two variables move together:

Cov(X,Y)=E[(XE[X])(YE[Y])]=E[XY]E[X]E[Y]\text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]
  • Cov(X,Y)>0\text{Cov}(X, Y) > 0: XX and YY tend to increase together
  • Cov(X,Y)<0\text{Cov}(X, Y) < 0: One tends to increase when the other decreases
  • Cov(X,Y)=0\text{Cov}(X, Y) = 0: No linear relationship (but they may still be dependent!)

Warning: Zero covariance does NOT imply independence. Consider XUniform(1,1)X \sim \text{Uniform}(-1, 1) and Y=X2Y = X^2. They are clearly dependent, but Cov(X,Y)=0\text{Cov}(X, Y) = 0.

Correlation

Pearson correlation normalizes covariance to [1,1][-1, 1]:

ρX,Y=Cov(X,Y)σXσY\rho_{X,Y} = \frac{\text{Cov}(X, Y)}{\sigma_X \cdot \sigma_Y}
  • ρ=1\rho = 1: Perfect positive linear relationship
  • ρ=1\rho = -1: Perfect negative linear relationship
  • ρ=0\rho = 0: No linear relationship

The Covariance Matrix

For a random vector X=[X1,X2,,Xd]\mathbf{X} = [X_1, X_2, \ldots, X_d]^\top, the covariance matrix is:

Σ=Cov(X)=E[(Xμ)(Xμ)]\boldsymbol{\Sigma} = \text{Cov}(\mathbf{X}) = \mathbb{E}[(\mathbf{X} - \boldsymbol{\mu})(\mathbf{X} - \boldsymbol{\mu})^\top] Σ=[Var(X1)Cov(X1,X2)Cov(X1,Xd)Cov(X2,X1)Var(X2)Cov(X2,Xd)Cov(Xd,X1)Cov(Xd,X2)Var(Xd)]\boldsymbol{\Sigma} = \begin{bmatrix} \text{Var}(X_1) & \text{Cov}(X_1, X_2) & \cdots & \text{Cov}(X_1, X_d) \\ \text{Cov}(X_2, X_1) & \text{Var}(X_2) & \cdots & \text{Cov}(X_2, X_d) \\ \vdots & \vdots & \ddots & \vdots \\ \text{Cov}(X_d, X_1) & \text{Cov}(X_d, X_2) & \cdots & \text{Var}(X_d) \end{bmatrix}

The covariance matrix is always symmetric and positive semi-definite. It’s central to the Multivariate Gaussian and to PCA (Principal Component Analysis).

Transformations of Random Variables

If Y=g(X)Y = g(X) and we know the distribution of XX, how do we find the distribution of YY?

Discrete Case

Simply map probabilities: P(Y=y)=P(g(X)=y)=x:g(x)=yP(X=x)P(Y = y) = P(g(X) = y) = \sum_{x: g(x) = y} P(X = x).

Continuous Case (Change of Variables)

For a monotonic, differentiable function gg with inverse g1g^{-1}:

fY(y)=fX(g1(y))ddyg1(y)f_Y(y) = f_X(g^{-1}(y)) \cdot \left|\frac{d}{dy} g^{-1}(y)\right|

The absolute value of the Jacobian corrects for how gg stretches or compresses the probability.

Example: If XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2) and Z=(Xμ)/σZ = (X - \mu)/\sigma, then ZN(0,1)Z \sim \mathcal{N}(0, 1). This standardization is fundamental to the Central Limit Theorem.

Conditional Expectation

The conditional expectation E[XY]\mathbb{E}[X \mid Y] is itself a random variable — it’s a function of YY:

E[XY=y]=xfXY(xy)dx\mathbb{E}[X \mid Y = y] = \int x \cdot f_{X \mid Y}(x \mid y) \, dx

Law of Total Expectation

E[X]=E[E[XY]]\mathbb{E}[X] = \mathbb{E}[\mathbb{E}[X \mid Y]]

This is incredibly powerful for computing expectations by conditioning on a simpler variable.

Law of Total Variance

Var(X)=E[Var(XY)]+Var(E[XY])\text{Var}(X) = \mathbb{E}[\text{Var}(X \mid Y)] + \text{Var}(\mathbb{E}[X \mid Y])

In words: total variance = average within-group variance + variance of group means. This decomposition is the theoretical basis for ANOVA and the bias-variance tradeoff in ML.

In ML: Conditional expectation is what regression models estimate. When we fit f(x)=E[YX=x]f(x) = \mathbb{E}[Y \mid X = x], we’re estimating the conditional mean.

Summary

  • Random variables map outcomes to numbers, enabling mathematical analysis
  • PMFs describe discrete variables; PDFs describe continuous variables
  • The CDF F(x)=P(Xx)F(x) = P(X \leq x) unifies both cases
  • Expectation is linear regardless of dependence — the most useful property
  • Variance measures spread; covariance measures co-movement
  • Zero covariance does not imply independence
  • The covariance matrix encodes all pairwise relationships in a random vector
  • Conditional expectation is what regression models estimate
  • Next: a deep dive into the key probability distributions

References

  • Bertsekas, D. P., & Tsitsiklis, J. N. (2008). Introduction to Probability (2nd ed.). Athena Scientific. Chapters 2-4.
  • Blitzstein, J. K., & Hwang, J. (2019). Introduction to Probability (2nd ed.). CRC Press.
  • Ross, S. M. (2014). A First Course in Probability (9th ed.). Pearson. Chapters 4-6.
  • Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury/Thomson Learning. Chapters 2-4.
  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 1.

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay