Nonparametric Statistics

Probability & Statistics Series 10 / 13

When Assumptions Break

Everything we’ve covered so far makes assumptions about the data: it’s Gaussian, it comes from an exponential family, it has finite variance. But real-world data often violates these assumptions — it may be skewed, heavy-tailed, multimodal, or just not fit any standard distribution.

Nonparametric methods make minimal assumptions about the underlying data distribution. Instead of fitting a fixed number of parameters, they let the data speak for itself. The model complexity grows with the amount of data.

Kernel Density Estimation (KDE)

The Problem

Given samples $x_1, \ldots, x_n$ , estimate the underlying probability density function $f(x)$ without assuming any parametric form.

The Histogram Approach (and Its Limitations)

A histogram is the simplest density estimate, but it has problems:

Bin edges are arbitrary — shifting bins changes the shape
Discontinuous — the estimated density jumps at bin boundaries
Bin width tradeoff — too narrow is noisy, too wide is smooth

The KDE Solution

Place a smooth kernel function $K$ at each data point and average:

\hat{f}(x) = \frac{1}{nh} \sum_{i=1}^{n} K\!\left(\frac{x - x_i}{h}\right)

where:

$K$ is a kernel function (typically a Gaussian: $K(u) = \frac{1}{\sqrt{2\pi}} e^{-u^2/2}$ )
$h$ is the bandwidth — controls smoothness

Bandwidth Selection

The bandwidth $h$ is the critical parameter:

Bandwidth	Result
Too small	Overfits: spiky, one peak per data point
Too large	Underfits: over-smoothed, loses structure
Just right	Captures the true density shape

Principled selection methods:

Silverman’s rule: $h = 1.06 \hat{\sigma} n^{-1/5}$ — optimal for Gaussian data
Scott’s rule: $h = 3.49 \hat{\sigma} n^{-1/3}$ — similar
Cross-validation: Maximize held-out log-likelihood

import numpy as np
from scipy.stats import gaussian_kde

data = np.concatenate([np.random.normal(-2, 0.5, 300),
                       np.random.normal(2, 1, 700)])

kde = gaussian_kde(data, bw_method='silverman')
x_grid = np.linspace(-5, 6, 1000)
density = kde(x_grid)

Multivariate KDE

KDE extends to $d$ dimensions using a multivariate kernel:

\hat{f}(\mathbf{x}) = \frac{1}{n |\mathbf{H}|^{1/2}} \sum_{i=1}^{n} K\!\left(\mathbf{H}^{-1/2}(\mathbf{x} - \mathbf{x}_i)\right)

where $\mathbf{H}$ is a $d \times d$ bandwidth matrix. In practice, this suffers from the curse of dimensionality — the number of samples needed grows exponentially with $d$ .

In ML: KDE is used in anomaly detection (low density = anomaly), data visualization, and as a building block in nonparametric classifiers.

Nonparametric Hypothesis Tests

When the assumptions of parametric tests (normality, equal variances) are violated, nonparametric alternatives based on ranks are robust.

The Rank Approach

Instead of using raw values, convert data to ranks (1st smallest, 2nd smallest, …). Ranks are invariant to monotonic transformations and robust to outliers.

Wilcoxon Signed-Rank Test

The nonparametric alternative to the paired t-test.

Setup: Paired observations $(X_i, Y_i)$ , testing whether the median difference is zero.

Compute differences $d_i = X_i - Y_i$
Rank the absolute differences $|d_i|$
Sum the ranks of positive differences: $W^+ = \sum_{d_i > 0} \text{rank}(|d_i|)$
Compare $W^+$ to its null distribution

In ML: Comparing two models when performance metrics are non-normal (e.g., skewed accuracy distributions across datasets).

Mann-Whitney U Test (Wilcoxon Rank-Sum)

The nonparametric alternative to the two-sample t-test.

Setup: Two independent groups, testing whether they come from the same distribution.

Combine both samples and rank them together
Compute $U$ — the number of times a value from group 1 precedes a value from group 2 in the ranking
Compare $U$ to its null distribution

Kruskal-Wallis Test

The nonparametric alternative to one-way ANOVA — compares more than two groups.

Kolmogorov-Smirnov Test

Tests whether a sample comes from a specific distribution (one-sample) or whether two samples come from the same distribution (two-sample).

The test statistic is the maximum difference between the empirical CDFs:

D = \sup_x |F_1(x) - F_2(x)|

In ML: Testing whether training and test distributions differ (dataset shift detection).

The Bootstrap

The bootstrap is one of the most powerful and widely used nonparametric methods.

The Idea

We have one sample and want to understand the sampling distribution of a statistic (mean, median, model accuracy). The bootstrap generates that distribution by resampling with replacement from the original sample.

Algorithm

Given sample $\mathbf{x} = (x_1, \ldots, x_n)$
For $b = 1, \ldots, B$ $b = 1, \dots, B$ :
- Draw $n$ samples with replacement from $\mathbf{x}$ to get $\mathbf{x}^{*(b)}$
- Compute the statistic: $\hat{\theta}^{*(b)} = s(\mathbf{x}^{*(b)})$
The empirical distribution of $\hat{\theta}^{*(1)}, \ldots, \hat{\theta}^{*(B)}$ approximates the sampling distribution

Bootstrap Confidence Intervals

Percentile method: Use the $\alpha/2$ and $1 - \alpha/2$ quantiles of the bootstrap distribution.

import numpy as np

def bootstrap_ci(data, statistic, n_bootstrap=10000, alpha=0.05):
    n = len(data)
    boot_stats = np.array([
        statistic(data[np.random.randint(0, n, n)])
        for _ in range(n_bootstrap)
    ])
    lower = np.percentile(boot_stats, 100 * alpha / 2)
    upper = np.percentile(boot_stats, 100 * (1 - alpha / 2))
    return lower, upper

data = np.random.exponential(2, size=50)
ci = bootstrap_ci(data, np.median)
print(f"95% CI for median: ({ci[0]:.2f}, {ci[1]:.2f})")

Why Bootstrap Works

The bootstrap is justified by the Central Limit Theorem and the plug-in principle: the empirical distribution $\hat{F}$ approximates the true distribution $F$ . As $n \to \infty$ , bootstrap distributions converge to the true sampling distribution.

Bootstrap in ML

Bagging (Bootstrap AGGregatING): Train multiple models on bootstrap samples and average predictions. This is the foundation of Random Forests
.632 bootstrap: Estimates out-of-sample error by noting that each bootstrap sample includes about 63.2% of the original data
Confidence intervals for model performance: More reliable than a single train/test split

Nonparametric Bayesian Methods

Classical Bayesian methods assume a fixed model structure. Nonparametric Bayes lets the model complexity grow with the data.

Dirichlet Process

The Dirichlet Process (DP) is a distribution over distributions. It defines a prior over an infinite number of mixture components, where the data determines how many are actually used.

G \sim \text{DP}(\alpha, G_0)

$\alpha$ : concentration parameter (larger = more clusters)
$G_0$ : base distribution (prior over component parameters)

Chinese Restaurant Process

An intuitive metaphor for the DP:

Customer 1 sits at table 1
Customer $n+1$ $n + 1$ either:
- Sits at existing table $k$ with probability $\propto n_k$ (number already there)
- Starts a new table with probability $\propto \alpha$

The “rich get richer” — popular tables attract more customers. This naturally produces a power-law distribution over cluster sizes.

Dirichlet Process Mixture Models

Replace GMMs with a DP prior on the number of components:

\begin{aligned} G &\sim \text{DP}(\alpha, G_0) \\ \theta_i &\sim G \\ x_i &\sim F(\theta_i) \end{aligned}

The model automatically infers the number of clusters from data — solving the model selection problem for EM-based GMMs.

Gaussian Processes

A Gaussian Process (GP) defines a prior over functions:

f(\mathbf{x}) \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}'))

where $m(\mathbf{x})$ is the mean function and $k(\mathbf{x}, \mathbf{x}')$ is the covariance (kernel) function. Any finite collection of function values is jointly multivariate Gaussian.

GPs provide:

Uncertainty quantification: Prediction intervals that widen where data is sparse
Nonparametric flexibility: No need to specify the functional form
Bayesian model selection: The marginal likelihood naturally trades off data fit and complexity

In ML: GPs are used in Bayesian optimization (hyperparameter tuning), spatial statistics, and as priors in reinforcement learning.

Comparison: Parametric vs Nonparametric

Aspect	Parametric	Nonparametric
Assumptions	Fixed distributional form	Minimal
Parameters	Fixed, finite	Grows with data
Sample efficiency	Better (if assumptions hold)	Needs more data
Robustness	Sensitive to violations	Robust
Interpretability	Often higher	Can be opaque
Examples	MLE, t-test, linear regression	KDE, bootstrap, Wilcoxon, GP

Practical wisdom: Use parametric methods when assumptions are approximately met — they’re more efficient. Use nonparametric methods when assumptions are clearly violated, when the sample size is small but outliers are present, or when you need robustness guarantees.

Summary

Nonparametric methods make minimal distributional assumptions
KDE estimates density by placing kernels at each data point; bandwidth selection is critical
Rank-based tests (Wilcoxon, Mann-Whitney, Kruskal-Wallis) are robust alternatives to t-tests and ANOVA
The bootstrap estimates sampling distributions by resampling with replacement
Dirichlet Processes let the number of clusters grow with data
Gaussian Processes define priors over functions with built-in uncertainty
Choose parametric when assumptions hold; nonparametric when they don’t

References

Wasserman, L. (2006). All of Nonparametric Statistics. Springer.
Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman & Hall/CRC.
Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press. gaussianprocess.org/gpml
Müller, P., & Quintana, F. A. (2004). “Nonparametric Bayesian Data Analysis.” Statistical Science, 19(1), 95—110.
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC.
Ferguson, T. S. (1973). “A Bayesian Analysis of Some Nonparametric Problems.” Annals of Statistics, 1(2), 209—230.
Hollander, M., Wolfe, D. A., & Chicken, E. (2013). Nonparametric Statistical Methods (3rd ed.). Wiley.