Convergence and the Central Limit Theorem

Why averages become Gaussian: the Law of Large Numbers, types of convergence, and the Central Limit Theorem explained.

Probability & Statistics March 6, 2026 7 min read

The Power of Averaging

Why does the Gaussian distribution appear everywhere in nature and statistics? Why do sample means become reliable estimators as we collect more data? The answers lie in two fundamental theorems that connect sample statistics to population parameters.

These theorems justify nearly everything in statistical inference and machine learning — from MLE consistency to why stochastic gradient descent works.

Types of Convergence

Before stating the theorems, we need to understand what it means for a sequence of random variables to “converge.”

Convergence in Probability

A sequence X1,X2,X_1, X_2, \ldots converges in probability to XX if:

limnP(XnX>ϵ)=0for all ϵ>0\lim_{n \to \infty} P(|X_n - X| > \epsilon) = 0 \quad \text{for all } \epsilon > 0

Informally: the probability of XnX_n being far from XX shrinks to zero. We write XnPXX_n \xrightarrow{P} X.

Almost Sure Convergence

A stronger notion: Xna.s.XX_n \xrightarrow{\text{a.s.}} X if:

P(limnXn=X)=1P\left(\lim_{n \to \infty} X_n = X\right) = 1

The sequence converges with probability 1 — it’s not just that deviations become unlikely, they eventually stop happening entirely.

Convergence in Distribution

The weakest notion: XndXX_n \xrightarrow{d} X if the cumulative distribution functions converge:

limnFXn(x)=FX(x)at all continuity points of FX\lim_{n \to \infty} F_{X_n}(x) = F_X(x) \quad \text{at all continuity points of } F_X

The shape of the distribution converges, but individual realizations need not.

Key insight: Almost sure \Rightarrow In probability \Rightarrow In distribution. Each type implies the next, but not the reverse.

The Law of Large Numbers

Weak Law (WLLN)

Let X1,X2,,XnX_1, X_2, \ldots, X_n be i.i.d. random variables with mean μ\mu and finite variance σ2\sigma^2. The sample mean:

Xˉn=1ni=1nXi\bar{X}_n = \frac{1}{n} \sum_{i=1}^{n} X_i

converges in probability to the true mean:

XˉnPμas n\bar{X}_n \xrightarrow{P} \mu \quad \text{as } n \to \infty

Strong Law (SLLN)

Under the same conditions, the convergence also holds almost surely:

Xˉna.s.μ\bar{X}_n \xrightarrow{\text{a.s.}} \mu

Proof Sketch (WLLN via Chebyshev)

Using Chebyshev’s inequality, P(YE[Y]kσY)1k2P(|Y - \mathbb{E}[Y]| \geq k\sigma_Y) \leq \frac{1}{k^2}:

P(Xˉnμϵ)Var(Xˉn)ϵ2=σ2/nϵ2=σ2nϵ20\begin{aligned} P(|\bar{X}_n - \mu| \geq \epsilon) &\leq \frac{\text{Var}(\bar{X}_n)}{\epsilon^2} \\[6pt] &= \frac{\sigma^2 / n}{\epsilon^2} \\[6pt] &= \frac{\sigma^2}{n\epsilon^2} \to 0 \end{aligned}

As nn grows, the variance of Xˉn\bar{X}_n shrinks like 1/n1/n, so the sample mean concentrates around μ\mu.

Why LLN Matters for ML

The Law of Large Numbers is why:

  • Empirical risk (average loss over training data) approximates true risk (expected loss)
  • Monte Carlo estimates converge to true expectations
  • Cross-validation scores become reliable with enough folds
  • MLE is consistent — with enough data, it finds the true parameters

The Central Limit Theorem

The CLT is arguably the most important theorem in all of statistics.

Statement

Let X1,X2,,XnX_1, X_2, \ldots, X_n be i.i.d. with mean μ\mu and finite variance σ2\sigma^2. Then:

Xˉnμσ/ndN(0,1)as n\frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} \xrightarrow{d} \mathcal{N}(0, 1) \quad \text{as } n \to \infty

Equivalently:

XˉnN(μ,σ2n)for large n\bar{X}_n \approx \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right) \quad \text{for large } n

In words: regardless of the original distribution, the sample mean is approximately Gaussian for large nn.

What Makes This Remarkable

The original XiX_i can be Bernoulli, Poisson, Exponential, Uniform, or any other distribution with finite variance. The CLT says that once you average enough of them, the result looks Gaussian.

This is why:

  • Measurement errors are approximately Gaussian (they’re sums of many small independent effects)
  • Test statistics (t-stat, z-stat) follow known distributions
  • Confidence intervals work
  • The Gaussian assumption in ML models is often reasonable for aggregated data

Worked Example

Suppose we roll a fair die (μ=3.5\mu = 3.5, σ2=35/122.917\sigma^2 = 35/12 \approx 2.917) and average n=100n = 100 rolls. By the CLT:

Xˉ100N(3.5,2.917100)=N(3.5,0.02917)\bar{X}_{100} \approx \mathcal{N}\left(3.5, \frac{2.917}{100}\right) = \mathcal{N}(3.5, 0.02917)

The standard deviation of Xˉ100\bar{X}_{100} is 0.029170.171\sqrt{0.02917} \approx 0.171. By the 68-95-99.7 rule, we expect the average to fall between 3.163.16 and 3.843.84 about 95% of the time.

import numpy as np

n_experiments = 10000
n_rolls = 100
means = [np.mean(np.random.randint(1, 7, n_rolls)) for _ in range(n_experiments)]

print(f"Mean of means: {np.mean(means):.3f}")   # ≈ 3.500
print(f"Std of means:  {np.std(means):.3f}")     # ≈ 0.171

Rate of Convergence

The CLT gives an approximation, but how good is it? The Berry-Esseen theorem provides a bound:

supxFn(x)Φ(x)Cρσ3n\sup_x |F_n(x) - \Phi(x)| \leq \frac{C \cdot \rho}{\sigma^3 \sqrt{n}}

where ρ=E[Xμ3]\rho = \mathbb{E}[|X - \mu|^3] is the third absolute moment and C0.4748C \leq 0.4748. The convergence rate is O(1/n)O(1/\sqrt{n}).

Rule of thumb: The Gaussian approximation is usually reasonable for n30n \geq 30, though this varies with the skewness of the original distribution.

Standard Error

The CLT tells us that the uncertainty in a sample mean is:

SE(Xˉn)=σn\text{SE}(\bar{X}_n) = \frac{\sigma}{\sqrt{n}}

This is the standard error — the standard deviation of the sampling distribution of the mean. Key properties:

  • It shrinks as n\sqrt{n}, not nn. To halve the uncertainty, you need 4 times the data
  • In practice, we estimate it as SE^=s/n\widehat{\text{SE}} = s / \sqrt{n}, where ss is the sample standard deviation
  • This directly determines the width of confidence intervals

Intuition: Collecting 10x more data doesn’t give 10x more precision — only 103.16\sqrt{10} \approx 3.16 times. This diminishing returns is fundamental to experimental design.

CLT for Sums

The CLT also applies to sums (not just averages):

Sn=i=1nXiN(nμ,nσ2)S_n = \sum_{i=1}^{n} X_i \approx \mathcal{N}(n\mu, n\sigma^2)

This is why:

  • The Binomial approaches a Gaussian for large nn (sum of Bernoulli trials)
  • The Poisson approaches a Gaussian for large λ\lambda (sum of rare events)
  • Total measurement error in instruments is approximately Gaussian

Multivariate CLT

For random vectors XiRd\mathbf{X}_i \in \mathbb{R}^d with mean μ\boldsymbol{\mu} and covariance Σ\boldsymbol{\Sigma}:

n(Xˉnμ)dN(0,Σ)\sqrt{n}(\bar{\mathbf{X}}_n - \boldsymbol{\mu}) \xrightarrow{d} \mathcal{N}(\mathbf{0}, \boldsymbol{\Sigma})

This justifies the use of multivariate Gaussian models for sample means of vector-valued data.

When the CLT Fails

The CLT requires finite variance. For distributions with infinite variance (heavy-tailed distributions like Cauchy or Pareto with α2\alpha \leq 2), the CLT does not apply. Averages do not converge to a Gaussian — they converge to stable distributions instead.

In practice, this matters for:

  • Financial returns — often heavy-tailed, not well-modeled by Gaussians
  • Network traffic — can exhibit long-range dependence
  • Insurance claims — extreme events dominate

Applications in Machine Learning

SGD and the CLT

Stochastic Gradient Descent computes gradients on mini-batches. The mini-batch gradient is an average of individual gradients:

g^=1Bi=1BLi(θ)\hat{g} = \frac{1}{B} \sum_{i=1}^{B} \nabla \mathcal{L}_i(\theta)

By the CLT, this average concentrates around the true gradient with standard error 1/B\propto 1/\sqrt{B}. Larger batches give less noisy gradient estimates, but with diminishing returns.

Asymptotic Normality of MLE

The CLT is the key ingredient in proving that MLE is asymptotically normal:

θ^MLEN(θ0,1nI(θ0))\hat{\theta}_{\text{MLE}} \approx \mathcal{N}\left(\theta_0, \frac{1}{n I(\theta_0)}\right)

where I(θ0)I(\theta_0) is the Fisher information. This is why MLE confidence intervals work.

Bootstrap

The bootstrap resampling method relies on the CLT to justify using the empirical distribution as a proxy for the population. By resampling with replacement and computing statistics, we can estimate standard errors and confidence intervals without distributional assumptions.

Summary

  • The Law of Large Numbers guarantees that sample means converge to the population mean
  • The Central Limit Theorem says sample means are approximately Gaussian, regardless of the original distribution
  • Standard error shrinks as 1/n1/\sqrt{n} — diminishing returns with more data
  • The CLT justifies confidence intervals, hypothesis tests, MLE asymptotics, and SGD convergence
  • The CLT fails for heavy-tailed distributions with infinite variance
  • These results form the theoretical backbone of the estimation and testing frameworks that follow

References

  • Bertsekas, D. P., & Tsitsiklis, J. N. (2008). Introduction to Probability (2nd ed.). Athena Scientific. Chapter 5.
  • Ross, S. M. (2014). A First Course in Probability (9th ed.). Pearson. Chapters 7-8.
  • Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury/Thomson Learning. Chapter 5.
  • Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. Chapters 5-6.
  • Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay