Convergence and the Central Limit Theorem

Probability & Statistics Series 5 / 13

The Power of Averaging

Why does the Gaussian distribution appear everywhere in nature and statistics? Why do sample means become reliable estimators as we collect more data? The answers lie in two fundamental theorems that connect sample statistics to population parameters.

These theorems justify nearly everything in statistical inference and machine learning — from MLE consistency to why stochastic gradient descent works.

Types of Convergence

Before stating the theorems, we need to understand what it means for a sequence of random variables to “converge.”

Convergence in Probability

A sequence $X_1, X_2, \ldots$ converges in probability to $X$ if:

\lim_{n \to \infty} P(|X_n - X| > \epsilon) = 0 \quad \text{for all } \epsilon > 0

Informally: the probability of $X_n$ being far from $X$ shrinks to zero. We write $X_n \xrightarrow{P} X$ .

Almost Sure Convergence

A stronger notion: $X_n \xrightarrow{\text{a.s.}} X$ if:

P\left(\lim_{n \to \infty} X_n = X\right) = 1

The sequence converges with probability 1 — it’s not just that deviations become unlikely, they eventually stop happening entirely.

Convergence in Distribution

The weakest notion: $X_n \xrightarrow{d} X$ if the cumulative distribution functions converge:

\lim_{n \to \infty} F_{X_n}(x) = F_X(x) \quad \text{at all continuity points of } F_X

The shape of the distribution converges, but individual realizations need not.

Key insight: Almost sure $\Rightarrow$ In probability $\Rightarrow$ In distribution. Each type implies the next, but not the reverse.

The Law of Large Numbers

Weak Law (WLLN)

Let $X_1, X_2, \ldots, X_n$ be i.i.d. random variables with mean $\mu$ and finite variance $\sigma^2$ . The sample mean:

\bar{X}_n = \frac{1}{n} \sum_{i=1}^{n} X_i

converges in probability to the true mean:

\bar{X}_n \xrightarrow{P} \mu \quad \text{as } n \to \infty

Strong Law (SLLN)

Under the same conditions, the convergence also holds almost surely:

\bar{X}_n \xrightarrow{\text{a.s.}} \mu

Proof Sketch (WLLN via Chebyshev)

Using Chebyshev’s inequality, $P(|Y - \mathbb{E}[Y]| \geq k\sigma_Y) \leq \frac{1}{k^2}$ :

\begin{aligned} P(|\bar{X}_n - \mu| \geq \epsilon) &\leq \frac{\text{Var}(\bar{X}_n)}{\epsilon^2} \\[6pt] &= \frac{\sigma^2 / n}{\epsilon^2} \\[6pt] &= \frac{\sigma^2}{n\epsilon^2} \to 0 \end{aligned}

As $n$ grows, the variance of $\bar{X}_n$ shrinks like $1/n$ , so the sample mean concentrates around $\mu$ .

Why LLN Matters for ML

The Law of Large Numbers is why:

Empirical risk (average loss over training data) approximates true risk (expected loss)
Monte Carlo estimates converge to true expectations
Cross-validation scores become reliable with enough folds
MLE is consistent — with enough data, it finds the true parameters

The Central Limit Theorem

The CLT is arguably the most important theorem in all of statistics.

Statement

Let $X_1, X_2, \ldots, X_n$ be i.i.d. with mean $\mu$ and finite variance $\sigma^2$ . Then:

\frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} \xrightarrow{d} \mathcal{N}(0, 1) \quad \text{as } n \to \infty

Equivalently:

\bar{X}_n \approx \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right) \quad \text{for large } n

In words: regardless of the original distribution, the sample mean is approximately Gaussian for large $n$ .

What Makes This Remarkable

The original $X_i$ can be Bernoulli, Poisson, Exponential, Uniform, or any other distribution with finite variance. The CLT says that once you average enough of them, the result looks Gaussian.

This is why:

Measurement errors are approximately Gaussian (they’re sums of many small independent effects)
Test statistics (t-stat, z-stat) follow known distributions
Confidence intervals work
The Gaussian assumption in ML models is often reasonable for aggregated data

Worked Example

Suppose we roll a fair die ( $\mu = 3.5$ , $\sigma^2 = 35/12 \approx 2.917$ ) and average $n = 100$ rolls. By the CLT:

\bar{X}_{100} \approx \mathcal{N}\left(3.5, \frac{2.917}{100}\right) = \mathcal{N}(3.5, 0.02917)

The standard deviation of $\bar{X}_{100}$ is $\sqrt{0.02917} \approx 0.171$ . By the 68-95-99.7 rule, we expect the average to fall between $3.16$ and $3.84$ about 95% of the time.

import numpy as np

n_experiments = 10000
n_rolls = 100
means = [np.mean(np.random.randint(1, 7, n_rolls)) for _ in range(n_experiments)]

print(f"Mean of means: {np.mean(means):.3f}")   # ≈ 3.500
print(f"Std of means:  {np.std(means):.3f}")     # ≈ 0.171

Rate of Convergence

The CLT gives an approximation, but how good is it? The Berry-Esseen theorem provides a bound:

\sup_x |F_n(x) - \Phi(x)| \leq \frac{C \cdot \rho}{\sigma^3 \sqrt{n}}

where $\rho = \mathbb{E}[|X - \mu|^3]$ is the third absolute moment and $C \leq 0.4748$ . The convergence rate is $O(1/\sqrt{n})$ .

Rule of thumb: The Gaussian approximation is usually reasonable for $n \geq 30$ , though this varies with the skewness of the original distribution.

Standard Error

The CLT tells us that the uncertainty in a sample mean is:

\text{SE}(\bar{X}_n) = \frac{\sigma}{\sqrt{n}}

This is the standard error — the standard deviation of the sampling distribution of the mean. Key properties:

It shrinks as $\sqrt{n}$ , not $n$ . To halve the uncertainty, you need 4 times the data
In practice, we estimate it as $\widehat{\text{SE}} = s / \sqrt{n}$ , where $s$ is the sample standard deviation
This directly determines the width of confidence intervals

Intuition: Collecting 10x more data doesn’t give 10x more precision — only $\sqrt{10} \approx 3.16$ times. This diminishing returns is fundamental to experimental design.

CLT for Sums

The CLT also applies to sums (not just averages):

S_n = \sum_{i=1}^{n} X_i \approx \mathcal{N}(n\mu, n\sigma^2)

This is why:

The Binomial approaches a Gaussian for large $n$ (sum of Bernoulli trials)
The Poisson approaches a Gaussian for large $\lambda$ (sum of rare events)
Total measurement error in instruments is approximately Gaussian

Multivariate CLT

For random vectors $\mathbf{X}_i \in \mathbb{R}^d$ with mean $\boldsymbol{\mu}$ and covariance $\boldsymbol{\Sigma}$ :

\sqrt{n}(\bar{\mathbf{X}}_n - \boldsymbol{\mu}) \xrightarrow{d} \mathcal{N}(\mathbf{0}, \boldsymbol{\Sigma})

This justifies the use of multivariate Gaussian models for sample means of vector-valued data.

When the CLT Fails

The CLT requires finite variance. For distributions with infinite variance (heavy-tailed distributions like Cauchy or Pareto with $\alpha \leq 2$ ), the CLT does not apply. Averages do not converge to a Gaussian — they converge to stable distributions instead.

In practice, this matters for:

Financial returns — often heavy-tailed, not well-modeled by Gaussians
Network traffic — can exhibit long-range dependence
Insurance claims — extreme events dominate

Applications in Machine Learning

SGD and the CLT

Stochastic Gradient Descent computes gradients on mini-batches. The mini-batch gradient is an average of individual gradients:

\hat{g} = \frac{1}{B} \sum_{i=1}^{B} \nabla \mathcal{L}_i(\theta)

By the CLT, this average concentrates around the true gradient with standard error $\propto 1/\sqrt{B}$ . Larger batches give less noisy gradient estimates, but with diminishing returns.

Asymptotic Normality of MLE

The CLT is the key ingredient in proving that MLE is asymptotically normal:

\hat{\theta}_{\text{MLE}} \approx \mathcal{N}\left(\theta_0, \frac{1}{n I(\theta_0)}\right)

where $I(\theta_0)$ is the Fisher information. This is why MLE confidence intervals work.

Bootstrap

The bootstrap resampling method relies on the CLT to justify using the empirical distribution as a proxy for the population. By resampling with replacement and computing statistics, we can estimate standard errors and confidence intervals without distributional assumptions.

Summary

The Law of Large Numbers guarantees that sample means converge to the population mean
The Central Limit Theorem says sample means are approximately Gaussian, regardless of the original distribution
Standard error shrinks as $1/\sqrt{n}$ — diminishing returns with more data
The CLT justifies confidence intervals, hypothesis tests, MLE asymptotics, and SGD convergence
The CLT fails for heavy-tailed distributions with infinite variance
These results form the theoretical backbone of the estimation and testing frameworks that follow

References

Bertsekas, D. P., & Tsitsiklis, J. N. (2008). Introduction to Probability (2nd ed.). Athena Scientific. Chapter 5.
Ross, S. M. (2014). A First Course in Probability (9th ed.). Pearson. Chapters 7-8.
Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury/Thomson Learning. Chapter 5.
Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. Chapters 5-6.
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.