Hypothesis Testing

p-values, significance levels, Type I/II errors, t-tests, and confidence intervals — the foundations of statistical inference.

Probability & Statistics March 6, 2026 9 min read

The Fundamental Question

Hypothesis testing answers a simple but critical question: Is the pattern I observe in data real, or could it have happened by chance?

You train a new model and it scores 2% higher than the baseline. Is that a genuine improvement, or just noise from a lucky test split? Hypothesis testing provides a rigorous framework to decide.

This framework builds directly on the Central Limit Theorem — the reason we can make probabilistic statements about sample statistics.

The Setup

Every hypothesis test has two competing hypotheses:

  • Null hypothesis H0H_0: The “nothing interesting” claim. There is no effect, no difference, no relationship.
  • Alternative hypothesis H1H_1 (or HaH_a): The claim we’re trying to find evidence for.

Key insight: We never “prove” H1H_1. We either reject H0H_0 (finding sufficient evidence against it) or fail to reject H0H_0 (not enough evidence). Absence of evidence is not evidence of absence.

Example

Question: Does a new drug lower blood pressure compared to a placebo?

  • H0H_0: The drug has no effect. μdrug=μplacebo\mu_{\text{drug}} = \mu_{\text{placebo}}
  • H1H_1: The drug lowers blood pressure. μdrug<μplacebo\mu_{\text{drug}} < \mu_{\text{placebo}}

Test Statistics

A test statistic is a single number computed from the data that measures how far the observed result is from what H0H_0 predicts.

The general form:

test statistic=observed valueexpected value under H0standard error\text{test statistic} = \frac{\text{observed value} - \text{expected value under } H_0}{\text{standard error}}

If H0H_0 is true, the test statistic follows a known distribution (often Gaussian or tt-distribution, thanks to the CLT). A large test statistic means the data is unlikely under H0H_0.

p-Values

The p-value is the probability of observing a test statistic at least as extreme as the one computed, assuming H0H_0 is true:

p=P(test statistictobservedH0)p = P(\text{test statistic} \geq t_{\text{observed}} \mid H_0)

A small p-value means: “If nothing interesting were happening, this result would be very unlikely.” The smaller the p-value, the stronger the evidence against H0H_0.

Decision Rule

Choose a significance level α\alpha (typically 0.05) before looking at the data:

  • If pαp \leq \alpha: Reject H0H_0. The result is “statistically significant”
  • If p>αp > \alpha: Fail to reject H0H_0. Not enough evidence

Warning: A p-value is NOT the probability that H0H_0 is true. It’s the probability of the data (or more extreme) given H0H_0. These are very different things. P(dataH0)P(H0data)P(\text{data} \mid H_0) \neq P(H_0 \mid \text{data}). For the latter, you need Bayesian inference.

Common Misconceptions

MisconceptionReality
p=0.03p = 0.03 means 3% chance H0H_0 is true”pp is about the data, not the hypothesis
p>0.05p > 0.05 means H0H_0 is true”Failure to reject \neq proof of no effect
”Significant = important”Statistical significance \neq practical importance
”Small p = large effect”p-value depends on sample size, not effect size

Types of Errors

H0H_0 is actually trueH0H_0 is actually false
Reject H0H_0Type I error (false positive)Correct (true positive)
Fail to reject H0H_0Correct (true negative)Type II error (false negative)

Type I Error (False Positive)

Probability: α\alpha (the significance level)

We conclude there’s an effect when there isn’t one. Setting α=0.05\alpha = 0.05 means we accept a 5% false positive rate.

Type II Error (False Negative)

Probability: β\beta

We miss a real effect. This depends on:

  • The true effect size — larger effects are easier to detect
  • The sample size nn — more data means more power
  • The significance level α\alpha — stricter threshold means more missed effects
  • The variance σ2\sigma^2 — noisier data makes detection harder

Statistical Power

Power =1β= 1 - \beta is the probability of correctly detecting a real effect.

Power=P(reject H0H1 is true)\text{Power} = P(\text{reject } H_0 \mid H_1 \text{ is true})

A common target is 80% power. Power analysis before an experiment determines the sample size needed to detect a given effect size.

Intuition: Think of power as a detector’s sensitivity. A metal detector with low power misses buried treasure. High power means you find effects that are actually there.

Common Tests

Z-Test

For testing a mean when σ\sigma is known (or nn is large):

Z=Xˉμ0σ/nZ = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}}

Under H0H_0, ZN(0,1)Z \sim \mathcal{N}(0, 1).

One-Sample t-Test

For testing a mean when σ\sigma is unknown (the common case):

t=Xˉμ0s/nt = \frac{\bar{X} - \mu_0}{s / \sqrt{n}}

where ss is the sample standard deviation. Under H0H_0, ttn1t \sim t_{n-1} (Student’s tt-distribution with n1n - 1 degrees of freedom).

The tt-distribution has heavier tails than the Gaussian, accounting for the extra uncertainty from estimating σ\sigma. As nn \to \infty, tn1N(0,1)t_{n-1} \to \mathcal{N}(0, 1).

Two-Sample t-Test

Comparing means of two independent groups:

t=Xˉ1Xˉ2s12n1+s22n2t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}

Example: Is model A’s accuracy significantly different from model B’s? Collect accuracy scores on kk test splits for each model and apply a two-sample t-test.

Paired t-Test

When observations are paired (same subjects, before/after):

t=dˉsd/nt = \frac{\bar{d}}{s_d / \sqrt{n}}

where di=X1,iX2,id_i = X_{1,i} - X_{2,i} are the paired differences.

In ML: Comparing two models on the same test folds — each fold gives a paired observation.

Chi-Squared Test

For testing relationships between categorical variables:

χ2=i(OiEi)2Ei\chi^2 = \sum_{i} \frac{(O_i - E_i)^2}{E_i}

where OiO_i are observed counts and EiE_i are expected counts under H0H_0 (independence).

In ML: Feature selection — testing whether a categorical feature is associated with the target variable.

Confidence Intervals

A confidence interval is the “inversion” of a hypothesis test. Instead of testing a specific value, it gives a range of plausible values for a parameter.

A (1α)(1 - \alpha) confidence interval for the mean:

Xˉ±tα/2,n1sn\bar{X} \pm t_{\alpha/2, n-1} \cdot \frac{s}{\sqrt{n}}

Interpretation

A 95% confidence interval means: if we repeated the experiment many times, 95% of the resulting intervals would contain the true parameter.

Warning: It does NOT mean “there’s a 95% probability the true value is in this specific interval.” The true value is fixed — it’s either in the interval or not. The probability statement is about the procedure, not the specific interval.

import numpy as np
from scipy import stats

data = np.array([85.3, 87.1, 86.5, 88.2, 84.9, 87.8, 86.1, 88.5])
n = len(data)
mean = np.mean(data)
se = stats.sem(data)
ci = stats.t.interval(0.95, df=n-1, loc=mean, scale=se)

print(f"Mean: {mean:.2f}")
print(f"95% CI: ({ci[0]:.2f}, {ci[1]:.2f})")

Relationship to Hypothesis Tests

A 95% confidence interval and a two-sided test at α=0.05\alpha = 0.05 are equivalent:

  • If μ0\mu_0 is inside the CI, the test fails to reject H0H_0
  • If μ0\mu_0 is outside the CI, the test rejects H0H_0

Confidence intervals are often more informative than p-values because they show both the direction and magnitude of the effect.

Multiple Testing Problem

Running many tests inflates the false positive rate. If you test 20 independent hypotheses at α=0.05\alpha = 0.05, the probability of at least one false positive is:

1(10.05)20=10.95200.641 - (1 - 0.05)^{20} = 1 - 0.95^{20} \approx 0.64

A 64% chance of a false discovery — far from the intended 5%.

Bonferroni Correction

The simplest fix: use α/m\alpha / m as the significance level, where mm is the number of tests:

αadjusted=αm\alpha_{\text{adjusted}} = \frac{\alpha}{m}

For 20 tests at α=0.05\alpha = 0.05: test each at 0.05/20=0.00250.05/20 = 0.0025.

This controls the family-wise error rate (FWER) — the probability of any false positive. The cost: greatly reduced power.

Benjamini-Hochberg (FDR)

A less conservative approach that controls the false discovery rate (FDR) — the expected proportion of false positives among rejections:

  1. Sort the mm p-values: p(1)p(2)p(m)p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}
  2. Find the largest kk such that p(k)kmαp_{(k)} \leq \frac{k}{m} \cdot \alpha
  3. Reject all hypotheses with pp(k)p \leq p_{(k)}

FDR control is standard in high-dimensional settings like genomics and is increasingly used in ML feature selection.

Effect Size

Statistical significance tells you whether an effect exists. Effect size tells you how large it is.

Cohen’s d — the standardized mean difference:

d=Xˉ1Xˉ2spooledd = \frac{\bar{X}_1 - \bar{X}_2}{s_{\text{pooled}}}
Cohen’s dInterpretation
0.2Small effect
0.5Medium effect
0.8Large effect

Key insight: With a large enough sample, you can get a “significant” p-value for a trivially small effect. Always report effect sizes alongside p-values. A 0.1% accuracy improvement can be p<0.001p < 0.001 with millions of test examples, but it’s meaningless in practice.

Hypothesis Testing in ML

Model Comparison

Comparing two models on the same dataset:

  1. Paired t-test on cross-validation folds: Run k-fold CV for both models, compute per-fold accuracy differences, apply a paired t-test
  2. McNemar’s test: For comparing classifiers on the same test set — counts disagreements between models

A/B Testing

Online experiments to compare model variants:

  1. Randomly assign users to control (A) or treatment (B)
  2. Collect a metric (click-through rate, revenue, engagement)
  3. Apply a two-sample test (often z-test for proportions)

Feature Importance

Testing whether a feature has a significant relationship with the target:

  • t-test: Continuous features, two classes
  • ANOVA (F-test): Continuous features, multiple classes
  • Chi-squared: Categorical features

Summary

  • Hypothesis testing quantifies whether observed patterns are real or due to chance
  • The p-value is the probability of the data given H0H_0, not the probability of H0H_0
  • Type I errors (false positives) are controlled by α\alpha; Type II errors by power analysis
  • The t-test is the workhorse of statistical testing; chi-squared handles categorical data
  • Confidence intervals are more informative than p-values — they show effect magnitude
  • Multiple testing requires correction (Bonferroni or FDR)
  • Always report effect sizes alongside significance — statistical \neq practical significance
  • For principled probability assignments to hypotheses, see Bayesian inference

References

  • Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury/Thomson Learning. Chapters 8-9.
  • Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. Chapters 10-11.
  • Rice, J. A. (2006). Mathematical Statistics and Data Analysis (3rd ed.). Duxbury Press.
  • Benjamini, Y., & Hochberg, Y. (1995). “Controlling the False Discovery Rate.” Journal of the Royal Statistical Society B, 57(1), 289—300.
  • Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates.
  • Demsar, J. (2006). “Statistical Comparisons of Classifiers over Multiple Data Sets.” Journal of Machine Learning Research, 7, 1—30.

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay