Nonparametric Statistics

Distribution-free methods: kernel density estimation, rank tests, bootstrap, and nonparametric Bayesian models for when assumptions fail.

Probability & Statistics March 6, 2026 8 min read

When Assumptions Break

Everything we’ve covered so far makes assumptions about the data: it’s Gaussian, it comes from an exponential family, it has finite variance. But real-world data often violates these assumptions — it may be skewed, heavy-tailed, multimodal, or just not fit any standard distribution.

Nonparametric methods make minimal assumptions about the underlying data distribution. Instead of fitting a fixed number of parameters, they let the data speak for itself. The model complexity grows with the amount of data.

Kernel Density Estimation (KDE)

The Problem

Given samples x1,,xnx_1, \ldots, x_n, estimate the underlying probability density function f(x)f(x) without assuming any parametric form.

The Histogram Approach (and Its Limitations)

A histogram is the simplest density estimate, but it has problems:

  • Bin edges are arbitrary — shifting bins changes the shape
  • Discontinuous — the estimated density jumps at bin boundaries
  • Bin width tradeoff — too narrow is noisy, too wide is smooth

The KDE Solution

Place a smooth kernel function KK at each data point and average:

f^(x)=1nhi=1nK ⁣(xxih)\hat{f}(x) = \frac{1}{nh} \sum_{i=1}^{n} K\!\left(\frac{x - x_i}{h}\right)

where:

  • KK is a kernel function (typically a Gaussian: K(u)=12πeu2/2K(u) = \frac{1}{\sqrt{2\pi}} e^{-u^2/2})
  • hh is the bandwidth — controls smoothness

Bandwidth Selection

The bandwidth hh is the critical parameter:

BandwidthResult
Too smallOverfits: spiky, one peak per data point
Too largeUnderfits: over-smoothed, loses structure
Just rightCaptures the true density shape

Principled selection methods:

  • Silverman’s rule: h=1.06σ^n1/5h = 1.06 \hat{\sigma} n^{-1/5} — optimal for Gaussian data
  • Scott’s rule: h=3.49σ^n1/3h = 3.49 \hat{\sigma} n^{-1/3} — similar
  • Cross-validation: Maximize held-out log-likelihood
import numpy as np
from scipy.stats import gaussian_kde

data = np.concatenate([np.random.normal(-2, 0.5, 300),
                       np.random.normal(2, 1, 700)])

kde = gaussian_kde(data, bw_method='silverman')
x_grid = np.linspace(-5, 6, 1000)
density = kde(x_grid)

Multivariate KDE

KDE extends to dd dimensions using a multivariate kernel:

f^(x)=1nH1/2i=1nK ⁣(H1/2(xxi))\hat{f}(\mathbf{x}) = \frac{1}{n |\mathbf{H}|^{1/2}} \sum_{i=1}^{n} K\!\left(\mathbf{H}^{-1/2}(\mathbf{x} - \mathbf{x}_i)\right)

where H\mathbf{H} is a d×dd \times d bandwidth matrix. In practice, this suffers from the curse of dimensionality — the number of samples needed grows exponentially with dd.

In ML: KDE is used in anomaly detection (low density = anomaly), data visualization, and as a building block in nonparametric classifiers.

Nonparametric Hypothesis Tests

When the assumptions of parametric tests (normality, equal variances) are violated, nonparametric alternatives based on ranks are robust.

The Rank Approach

Instead of using raw values, convert data to ranks (1st smallest, 2nd smallest, …). Ranks are invariant to monotonic transformations and robust to outliers.

Wilcoxon Signed-Rank Test

The nonparametric alternative to the paired t-test.

Setup: Paired observations (Xi,Yi)(X_i, Y_i), testing whether the median difference is zero.

  1. Compute differences di=XiYid_i = X_i - Y_i
  2. Rank the absolute differences di|d_i|
  3. Sum the ranks of positive differences: W+=di>0rank(di)W^+ = \sum_{d_i > 0} \text{rank}(|d_i|)
  4. Compare W+W^+ to its null distribution

In ML: Comparing two models when performance metrics are non-normal (e.g., skewed accuracy distributions across datasets).

Mann-Whitney U Test (Wilcoxon Rank-Sum)

The nonparametric alternative to the two-sample t-test.

Setup: Two independent groups, testing whether they come from the same distribution.

  1. Combine both samples and rank them together
  2. Compute UU — the number of times a value from group 1 precedes a value from group 2 in the ranking
  3. Compare UU to its null distribution

Kruskal-Wallis Test

The nonparametric alternative to one-way ANOVA — compares more than two groups.

Kolmogorov-Smirnov Test

Tests whether a sample comes from a specific distribution (one-sample) or whether two samples come from the same distribution (two-sample).

The test statistic is the maximum difference between the empirical CDFs:

D=supxF1(x)F2(x)D = \sup_x |F_1(x) - F_2(x)|

In ML: Testing whether training and test distributions differ (dataset shift detection).

The Bootstrap

The bootstrap is one of the most powerful and widely used nonparametric methods.

The Idea

We have one sample and want to understand the sampling distribution of a statistic (mean, median, model accuracy). The bootstrap generates that distribution by resampling with replacement from the original sample.

Algorithm

  1. Given sample x=(x1,,xn)\mathbf{x} = (x_1, \ldots, x_n)
  2. For b=1,,Bb = 1, \ldots, B:
    • Draw nn samples with replacement from x\mathbf{x} to get x(b)\mathbf{x}^{*(b)}
    • Compute the statistic: θ^(b)=s(x(b))\hat{\theta}^{*(b)} = s(\mathbf{x}^{*(b)})
  3. The empirical distribution of θ^(1),,θ^(B)\hat{\theta}^{*(1)}, \ldots, \hat{\theta}^{*(B)} approximates the sampling distribution

Bootstrap Confidence Intervals

Percentile method: Use the α/2\alpha/2 and 1α/21 - \alpha/2 quantiles of the bootstrap distribution.

import numpy as np

def bootstrap_ci(data, statistic, n_bootstrap=10000, alpha=0.05):
    n = len(data)
    boot_stats = np.array([
        statistic(data[np.random.randint(0, n, n)])
        for _ in range(n_bootstrap)
    ])
    lower = np.percentile(boot_stats, 100 * alpha / 2)
    upper = np.percentile(boot_stats, 100 * (1 - alpha / 2))
    return lower, upper

data = np.random.exponential(2, size=50)
ci = bootstrap_ci(data, np.median)
print(f"95% CI for median: ({ci[0]:.2f}, {ci[1]:.2f})")

Why Bootstrap Works

The bootstrap is justified by the Central Limit Theorem and the plug-in principle: the empirical distribution F^\hat{F} approximates the true distribution FF. As nn \to \infty, bootstrap distributions converge to the true sampling distribution.

Bootstrap in ML

  • Bagging (Bootstrap AGGregatING): Train multiple models on bootstrap samples and average predictions. This is the foundation of Random Forests
  • .632 bootstrap: Estimates out-of-sample error by noting that each bootstrap sample includes about 63.2% of the original data
  • Confidence intervals for model performance: More reliable than a single train/test split

Nonparametric Bayesian Methods

Classical Bayesian methods assume a fixed model structure. Nonparametric Bayes lets the model complexity grow with the data.

Dirichlet Process

The Dirichlet Process (DP) is a distribution over distributions. It defines a prior over an infinite number of mixture components, where the data determines how many are actually used.

GDP(α,G0)G \sim \text{DP}(\alpha, G_0)
  • α\alpha: concentration parameter (larger = more clusters)
  • G0G_0: base distribution (prior over component parameters)

Chinese Restaurant Process

An intuitive metaphor for the DP:

  • Customer 1 sits at table 1
  • Customer n+1n+1 either:
    • Sits at existing table kk with probability nk\propto n_k (number already there)
    • Starts a new table with probability α\propto \alpha

The “rich get richer” — popular tables attract more customers. This naturally produces a power-law distribution over cluster sizes.

Dirichlet Process Mixture Models

Replace GMMs with a DP prior on the number of components:

GDP(α,G0)θiGxiF(θi)\begin{aligned} G &\sim \text{DP}(\alpha, G_0) \\ \theta_i &\sim G \\ x_i &\sim F(\theta_i) \end{aligned}

The model automatically infers the number of clusters from data — solving the model selection problem for EM-based GMMs.

Gaussian Processes

A Gaussian Process (GP) defines a prior over functions:

f(x)GP(m(x),k(x,x))f(\mathbf{x}) \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}'))

where m(x)m(\mathbf{x}) is the mean function and k(x,x)k(\mathbf{x}, \mathbf{x}') is the covariance (kernel) function. Any finite collection of function values is jointly multivariate Gaussian.

GPs provide:

  • Uncertainty quantification: Prediction intervals that widen where data is sparse
  • Nonparametric flexibility: No need to specify the functional form
  • Bayesian model selection: The marginal likelihood naturally trades off data fit and complexity

In ML: GPs are used in Bayesian optimization (hyperparameter tuning), spatial statistics, and as priors in reinforcement learning.

Comparison: Parametric vs Nonparametric

AspectParametricNonparametric
AssumptionsFixed distributional formMinimal
ParametersFixed, finiteGrows with data
Sample efficiencyBetter (if assumptions hold)Needs more data
RobustnessSensitive to violationsRobust
InterpretabilityOften higherCan be opaque
ExamplesMLE, t-test, linear regressionKDE, bootstrap, Wilcoxon, GP

Practical wisdom: Use parametric methods when assumptions are approximately met — they’re more efficient. Use nonparametric methods when assumptions are clearly violated, when the sample size is small but outliers are present, or when you need robustness guarantees.

Summary

  • Nonparametric methods make minimal distributional assumptions
  • KDE estimates density by placing kernels at each data point; bandwidth selection is critical
  • Rank-based tests (Wilcoxon, Mann-Whitney, Kruskal-Wallis) are robust alternatives to t-tests and ANOVA
  • The bootstrap estimates sampling distributions by resampling with replacement
  • Dirichlet Processes let the number of clusters grow with data
  • Gaussian Processes define priors over functions with built-in uncertainty
  • Choose parametric when assumptions hold; nonparametric when they don’t

References

  • Wasserman, L. (2006). All of Nonparametric Statistics. Springer.
  • Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman & Hall/CRC.
  • Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press. gaussianprocess.org/gpml
  • Müller, P., & Quintana, F. A. (2004). “Nonparametric Bayesian Data Analysis.” Statistical Science, 19(1), 95—110.
  • Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC.
  • Ferguson, T. S. (1973). “A Bayesian Analysis of Some Nonparametric Problems.” Annals of Statistics, 1(2), 209—230.
  • Hollander, M., Wolfe, D. A., & Chicken, E. (2013). Nonparametric Statistical Methods (3rd ed.). Wiley.

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay