- 01 Probability Fundamentals 02 Random Variables and Expectation 03 Probability Distributions 04 The Exponential Family 05 Convergence and the Central Limit Theorem 06 Maximum Likelihood Estimation 07 MAP Estimation 08 The EM Algorithm 09 Hypothesis Testing 10 Nonparametric Statistics 11 Bayesian Inference 12 Probabilistic Graphical Models 13 Sampling Methods
When Assumptions Break
Everything we’ve covered so far makes assumptions about the data: it’s Gaussian, it comes from an exponential family, it has finite variance. But real-world data often violates these assumptions — it may be skewed, heavy-tailed, multimodal, or just not fit any standard distribution.
Nonparametric methods make minimal assumptions about the underlying data distribution. Instead of fitting a fixed number of parameters, they let the data speak for itself. The model complexity grows with the amount of data.
Kernel Density Estimation (KDE)
The Problem
Given samples , estimate the underlying probability density function without assuming any parametric form.
The Histogram Approach (and Its Limitations)
A histogram is the simplest density estimate, but it has problems:
- Bin edges are arbitrary — shifting bins changes the shape
- Discontinuous — the estimated density jumps at bin boundaries
- Bin width tradeoff — too narrow is noisy, too wide is smooth
The KDE Solution
Place a smooth kernel function at each data point and average:
where:
- is a kernel function (typically a Gaussian: )
- is the bandwidth — controls smoothness
Bandwidth Selection
The bandwidth is the critical parameter:
| Bandwidth | Result |
|---|---|
| Too small | Overfits: spiky, one peak per data point |
| Too large | Underfits: over-smoothed, loses structure |
| Just right | Captures the true density shape |
Principled selection methods:
- Silverman’s rule: — optimal for Gaussian data
- Scott’s rule: — similar
- Cross-validation: Maximize held-out log-likelihood
import numpy as np
from scipy.stats import gaussian_kde
data = np.concatenate([np.random.normal(-2, 0.5, 300),
np.random.normal(2, 1, 700)])
kde = gaussian_kde(data, bw_method='silverman')
x_grid = np.linspace(-5, 6, 1000)
density = kde(x_grid)
Multivariate KDE
KDE extends to dimensions using a multivariate kernel:
where is a bandwidth matrix. In practice, this suffers from the curse of dimensionality — the number of samples needed grows exponentially with .
In ML: KDE is used in anomaly detection (low density = anomaly), data visualization, and as a building block in nonparametric classifiers.
Nonparametric Hypothesis Tests
When the assumptions of parametric tests (normality, equal variances) are violated, nonparametric alternatives based on ranks are robust.
The Rank Approach
Instead of using raw values, convert data to ranks (1st smallest, 2nd smallest, …). Ranks are invariant to monotonic transformations and robust to outliers.
Wilcoxon Signed-Rank Test
The nonparametric alternative to the paired t-test.
Setup: Paired observations , testing whether the median difference is zero.
- Compute differences
- Rank the absolute differences
- Sum the ranks of positive differences:
- Compare to its null distribution
In ML: Comparing two models when performance metrics are non-normal (e.g., skewed accuracy distributions across datasets).
Mann-Whitney U Test (Wilcoxon Rank-Sum)
The nonparametric alternative to the two-sample t-test.
Setup: Two independent groups, testing whether they come from the same distribution.
- Combine both samples and rank them together
- Compute — the number of times a value from group 1 precedes a value from group 2 in the ranking
- Compare to its null distribution
Kruskal-Wallis Test
The nonparametric alternative to one-way ANOVA — compares more than two groups.
Kolmogorov-Smirnov Test
Tests whether a sample comes from a specific distribution (one-sample) or whether two samples come from the same distribution (two-sample).
The test statistic is the maximum difference between the empirical CDFs:
In ML: Testing whether training and test distributions differ (dataset shift detection).
The Bootstrap
The bootstrap is one of the most powerful and widely used nonparametric methods.
The Idea
We have one sample and want to understand the sampling distribution of a statistic (mean, median, model accuracy). The bootstrap generates that distribution by resampling with replacement from the original sample.
Algorithm
- Given sample
- For :
- Draw samples with replacement from to get
- Compute the statistic:
- The empirical distribution of approximates the sampling distribution
Bootstrap Confidence Intervals
Percentile method: Use the and quantiles of the bootstrap distribution.
import numpy as np
def bootstrap_ci(data, statistic, n_bootstrap=10000, alpha=0.05):
n = len(data)
boot_stats = np.array([
statistic(data[np.random.randint(0, n, n)])
for _ in range(n_bootstrap)
])
lower = np.percentile(boot_stats, 100 * alpha / 2)
upper = np.percentile(boot_stats, 100 * (1 - alpha / 2))
return lower, upper
data = np.random.exponential(2, size=50)
ci = bootstrap_ci(data, np.median)
print(f"95% CI for median: ({ci[0]:.2f}, {ci[1]:.2f})")
Why Bootstrap Works
The bootstrap is justified by the Central Limit Theorem and the plug-in principle: the empirical distribution approximates the true distribution . As , bootstrap distributions converge to the true sampling distribution.
Bootstrap in ML
- Bagging (Bootstrap AGGregatING): Train multiple models on bootstrap samples and average predictions. This is the foundation of Random Forests
- .632 bootstrap: Estimates out-of-sample error by noting that each bootstrap sample includes about 63.2% of the original data
- Confidence intervals for model performance: More reliable than a single train/test split
Nonparametric Bayesian Methods
Classical Bayesian methods assume a fixed model structure. Nonparametric Bayes lets the model complexity grow with the data.
Dirichlet Process
The Dirichlet Process (DP) is a distribution over distributions. It defines a prior over an infinite number of mixture components, where the data determines how many are actually used.
- : concentration parameter (larger = more clusters)
- : base distribution (prior over component parameters)
Chinese Restaurant Process
An intuitive metaphor for the DP:
- Customer 1 sits at table 1
- Customer either:
- Sits at existing table with probability (number already there)
- Starts a new table with probability
The “rich get richer” — popular tables attract more customers. This naturally produces a power-law distribution over cluster sizes.
Dirichlet Process Mixture Models
Replace GMMs with a DP prior on the number of components:
The model automatically infers the number of clusters from data — solving the model selection problem for EM-based GMMs.
Gaussian Processes
A Gaussian Process (GP) defines a prior over functions:
where is the mean function and is the covariance (kernel) function. Any finite collection of function values is jointly multivariate Gaussian.
GPs provide:
- Uncertainty quantification: Prediction intervals that widen where data is sparse
- Nonparametric flexibility: No need to specify the functional form
- Bayesian model selection: The marginal likelihood naturally trades off data fit and complexity
In ML: GPs are used in Bayesian optimization (hyperparameter tuning), spatial statistics, and as priors in reinforcement learning.
Comparison: Parametric vs Nonparametric
| Aspect | Parametric | Nonparametric |
|---|---|---|
| Assumptions | Fixed distributional form | Minimal |
| Parameters | Fixed, finite | Grows with data |
| Sample efficiency | Better (if assumptions hold) | Needs more data |
| Robustness | Sensitive to violations | Robust |
| Interpretability | Often higher | Can be opaque |
| Examples | MLE, t-test, linear regression | KDE, bootstrap, Wilcoxon, GP |
Practical wisdom: Use parametric methods when assumptions are approximately met — they’re more efficient. Use nonparametric methods when assumptions are clearly violated, when the sample size is small but outliers are present, or when you need robustness guarantees.
Summary
- Nonparametric methods make minimal distributional assumptions
- KDE estimates density by placing kernels at each data point; bandwidth selection is critical
- Rank-based tests (Wilcoxon, Mann-Whitney, Kruskal-Wallis) are robust alternatives to t-tests and ANOVA
- The bootstrap estimates sampling distributions by resampling with replacement
- Dirichlet Processes let the number of clusters grow with data
- Gaussian Processes define priors over functions with built-in uncertainty
- Choose parametric when assumptions hold; nonparametric when they don’t
References
- Wasserman, L. (2006). All of Nonparametric Statistics. Springer.
- Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman & Hall/CRC.
- Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press. gaussianprocess.org/gpml
- Müller, P., & Quintana, F. A. (2004). “Nonparametric Bayesian Data Analysis.” Statistical Science, 19(1), 95—110.
- Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC.
- Ferguson, T. S. (1973). “A Bayesian Analysis of Some Nonparametric Problems.” Annals of Statistics, 1(2), 209—230.
- Hollander, M., Wolfe, D. A., & Chicken, E. (2013). Nonparametric Statistical Methods (3rd ed.). Wiley.