- 01 Probability Fundamentals 02 Random Variables and Expectation 03 Probability Distributions 04 The Exponential Family 05 Convergence and the Central Limit Theorem 06 Maximum Likelihood Estimation 07 MAP Estimation 08 The EM Algorithm 09 Hypothesis Testing 10 Nonparametric Statistics 11 Bayesian Inference 12 Probabilistic Graphical Models 13 Sampling Methods
The Power of Averaging
Why does the Gaussian distribution appear everywhere in nature and statistics? Why do sample means become reliable estimators as we collect more data? The answers lie in two fundamental theorems that connect sample statistics to population parameters.
These theorems justify nearly everything in statistical inference and machine learning — from MLE consistency to why stochastic gradient descent works.
Types of Convergence
Before stating the theorems, we need to understand what it means for a sequence of random variables to “converge.”
Convergence in Probability
A sequence converges in probability to if:
Informally: the probability of being far from shrinks to zero. We write .
Almost Sure Convergence
A stronger notion: if:
The sequence converges with probability 1 — it’s not just that deviations become unlikely, they eventually stop happening entirely.
Convergence in Distribution
The weakest notion: if the cumulative distribution functions converge:
The shape of the distribution converges, but individual realizations need not.
Key insight: Almost sure In probability In distribution. Each type implies the next, but not the reverse.
The Law of Large Numbers
Weak Law (WLLN)
Let be i.i.d. random variables with mean and finite variance . The sample mean:
converges in probability to the true mean:
Strong Law (SLLN)
Under the same conditions, the convergence also holds almost surely:
Proof Sketch (WLLN via Chebyshev)
Using Chebyshev’s inequality, :
As grows, the variance of shrinks like , so the sample mean concentrates around .
Why LLN Matters for ML
The Law of Large Numbers is why:
- Empirical risk (average loss over training data) approximates true risk (expected loss)
- Monte Carlo estimates converge to true expectations
- Cross-validation scores become reliable with enough folds
- MLE is consistent — with enough data, it finds the true parameters
The Central Limit Theorem
The CLT is arguably the most important theorem in all of statistics.
Statement
Let be i.i.d. with mean and finite variance . Then:
Equivalently:
In words: regardless of the original distribution, the sample mean is approximately Gaussian for large .
What Makes This Remarkable
The original can be Bernoulli, Poisson, Exponential, Uniform, or any other distribution with finite variance. The CLT says that once you average enough of them, the result looks Gaussian.
This is why:
- Measurement errors are approximately Gaussian (they’re sums of many small independent effects)
- Test statistics (t-stat, z-stat) follow known distributions
- Confidence intervals work
- The Gaussian assumption in ML models is often reasonable for aggregated data
Worked Example
Suppose we roll a fair die (, ) and average rolls. By the CLT:
The standard deviation of is . By the 68-95-99.7 rule, we expect the average to fall between and about 95% of the time.
import numpy as np
n_experiments = 10000
n_rolls = 100
means = [np.mean(np.random.randint(1, 7, n_rolls)) for _ in range(n_experiments)]
print(f"Mean of means: {np.mean(means):.3f}") # ≈ 3.500
print(f"Std of means: {np.std(means):.3f}") # ≈ 0.171
Rate of Convergence
The CLT gives an approximation, but how good is it? The Berry-Esseen theorem provides a bound:
where is the third absolute moment and . The convergence rate is .
Rule of thumb: The Gaussian approximation is usually reasonable for , though this varies with the skewness of the original distribution.
Standard Error
The CLT tells us that the uncertainty in a sample mean is:
This is the standard error — the standard deviation of the sampling distribution of the mean. Key properties:
- It shrinks as , not . To halve the uncertainty, you need 4 times the data
- In practice, we estimate it as , where is the sample standard deviation
- This directly determines the width of confidence intervals
Intuition: Collecting 10x more data doesn’t give 10x more precision — only times. This diminishing returns is fundamental to experimental design.
CLT for Sums
The CLT also applies to sums (not just averages):
This is why:
- The Binomial approaches a Gaussian for large (sum of Bernoulli trials)
- The Poisson approaches a Gaussian for large (sum of rare events)
- Total measurement error in instruments is approximately Gaussian
Multivariate CLT
For random vectors with mean and covariance :
This justifies the use of multivariate Gaussian models for sample means of vector-valued data.
When the CLT Fails
The CLT requires finite variance. For distributions with infinite variance (heavy-tailed distributions like Cauchy or Pareto with ), the CLT does not apply. Averages do not converge to a Gaussian — they converge to stable distributions instead.
In practice, this matters for:
- Financial returns — often heavy-tailed, not well-modeled by Gaussians
- Network traffic — can exhibit long-range dependence
- Insurance claims — extreme events dominate
Applications in Machine Learning
SGD and the CLT
Stochastic Gradient Descent computes gradients on mini-batches. The mini-batch gradient is an average of individual gradients:
By the CLT, this average concentrates around the true gradient with standard error . Larger batches give less noisy gradient estimates, but with diminishing returns.
Asymptotic Normality of MLE
The CLT is the key ingredient in proving that MLE is asymptotically normal:
where is the Fisher information. This is why MLE confidence intervals work.
Bootstrap
The bootstrap resampling method relies on the CLT to justify using the empirical distribution as a proxy for the population. By resampling with replacement and computing statistics, we can estimate standard errors and confidence intervals without distributional assumptions.
Summary
- The Law of Large Numbers guarantees that sample means converge to the population mean
- The Central Limit Theorem says sample means are approximately Gaussian, regardless of the original distribution
- Standard error shrinks as — diminishing returns with more data
- The CLT justifies confidence intervals, hypothesis tests, MLE asymptotics, and SGD convergence
- The CLT fails for heavy-tailed distributions with infinite variance
- These results form the theoretical backbone of the estimation and testing frameworks that follow
References
- Bertsekas, D. P., & Tsitsiklis, J. N. (2008). Introduction to Probability (2nd ed.). Athena Scientific. Chapter 5.
- Ross, S. M. (2014). A First Course in Probability (9th ed.). Pearson. Chapters 7-8.
- Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury/Thomson Learning. Chapter 5.
- Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. Chapters 5-6.
- Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.