- 01 Probability Fundamentals 02 Random Variables and Expectation 03 Probability Distributions 04 The Exponential Family 05 Convergence and the Central Limit Theorem 06 Maximum Likelihood Estimation 07 MAP Estimation 08 The EM Algorithm 09 Hypothesis Testing 10 Nonparametric Statistics 11 Bayesian Inference 12 Probabilistic Graphical Models 13 Sampling Methods
The Fundamental Question
Hypothesis testing answers a simple but critical question: Is the pattern I observe in data real, or could it have happened by chance?
You train a new model and it scores 2% higher than the baseline. Is that a genuine improvement, or just noise from a lucky test split? Hypothesis testing provides a rigorous framework to decide.
This framework builds directly on the Central Limit Theorem — the reason we can make probabilistic statements about sample statistics.
The Setup
Every hypothesis test has two competing hypotheses:
- Null hypothesis : The “nothing interesting” claim. There is no effect, no difference, no relationship.
- Alternative hypothesis (or ): The claim we’re trying to find evidence for.
Key insight: We never “prove” . We either reject (finding sufficient evidence against it) or fail to reject (not enough evidence). Absence of evidence is not evidence of absence.
Example
Question: Does a new drug lower blood pressure compared to a placebo?
- : The drug has no effect.
- : The drug lowers blood pressure.
Test Statistics
A test statistic is a single number computed from the data that measures how far the observed result is from what predicts.
The general form:
If is true, the test statistic follows a known distribution (often Gaussian or -distribution, thanks to the CLT). A large test statistic means the data is unlikely under .
p-Values
The p-value is the probability of observing a test statistic at least as extreme as the one computed, assuming is true:
A small p-value means: “If nothing interesting were happening, this result would be very unlikely.” The smaller the p-value, the stronger the evidence against .
Decision Rule
Choose a significance level (typically 0.05) before looking at the data:
- If : Reject . The result is “statistically significant”
- If : Fail to reject . Not enough evidence
Warning: A p-value is NOT the probability that is true. It’s the probability of the data (or more extreme) given . These are very different things. . For the latter, you need Bayesian inference.
Common Misconceptions
| Misconception | Reality |
|---|---|
| ” means 3% chance is true” | is about the data, not the hypothesis |
| ” means is true” | Failure to reject proof of no effect |
| ”Significant = important” | Statistical significance practical importance |
| ”Small p = large effect” | p-value depends on sample size, not effect size |
Types of Errors
| is actually true | is actually false | |
|---|---|---|
| Reject | Type I error (false positive) | Correct (true positive) |
| Fail to reject | Correct (true negative) | Type II error (false negative) |
Type I Error (False Positive)
Probability: (the significance level)
We conclude there’s an effect when there isn’t one. Setting means we accept a 5% false positive rate.
Type II Error (False Negative)
Probability:
We miss a real effect. This depends on:
- The true effect size — larger effects are easier to detect
- The sample size — more data means more power
- The significance level — stricter threshold means more missed effects
- The variance — noisier data makes detection harder
Statistical Power
Power is the probability of correctly detecting a real effect.
A common target is 80% power. Power analysis before an experiment determines the sample size needed to detect a given effect size.
Intuition: Think of power as a detector’s sensitivity. A metal detector with low power misses buried treasure. High power means you find effects that are actually there.
Common Tests
Z-Test
For testing a mean when is known (or is large):
Under , .
One-Sample t-Test
For testing a mean when is unknown (the common case):
where is the sample standard deviation. Under , (Student’s -distribution with degrees of freedom).
The -distribution has heavier tails than the Gaussian, accounting for the extra uncertainty from estimating . As , .
Two-Sample t-Test
Comparing means of two independent groups:
Example: Is model A’s accuracy significantly different from model B’s? Collect accuracy scores on test splits for each model and apply a two-sample t-test.
Paired t-Test
When observations are paired (same subjects, before/after):
where are the paired differences.
In ML: Comparing two models on the same test folds — each fold gives a paired observation.
Chi-Squared Test
For testing relationships between categorical variables:
where are observed counts and are expected counts under (independence).
In ML: Feature selection — testing whether a categorical feature is associated with the target variable.
Confidence Intervals
A confidence interval is the “inversion” of a hypothesis test. Instead of testing a specific value, it gives a range of plausible values for a parameter.
A confidence interval for the mean:
Interpretation
A 95% confidence interval means: if we repeated the experiment many times, 95% of the resulting intervals would contain the true parameter.
Warning: It does NOT mean “there’s a 95% probability the true value is in this specific interval.” The true value is fixed — it’s either in the interval or not. The probability statement is about the procedure, not the specific interval.
import numpy as np
from scipy import stats
data = np.array([85.3, 87.1, 86.5, 88.2, 84.9, 87.8, 86.1, 88.5])
n = len(data)
mean = np.mean(data)
se = stats.sem(data)
ci = stats.t.interval(0.95, df=n-1, loc=mean, scale=se)
print(f"Mean: {mean:.2f}")
print(f"95% CI: ({ci[0]:.2f}, {ci[1]:.2f})")
Relationship to Hypothesis Tests
A 95% confidence interval and a two-sided test at are equivalent:
- If is inside the CI, the test fails to reject
- If is outside the CI, the test rejects
Confidence intervals are often more informative than p-values because they show both the direction and magnitude of the effect.
Multiple Testing Problem
Running many tests inflates the false positive rate. If you test 20 independent hypotheses at , the probability of at least one false positive is:
A 64% chance of a false discovery — far from the intended 5%.
Bonferroni Correction
The simplest fix: use as the significance level, where is the number of tests:
For 20 tests at : test each at .
This controls the family-wise error rate (FWER) — the probability of any false positive. The cost: greatly reduced power.
Benjamini-Hochberg (FDR)
A less conservative approach that controls the false discovery rate (FDR) — the expected proportion of false positives among rejections:
- Sort the p-values:
- Find the largest such that
- Reject all hypotheses with
FDR control is standard in high-dimensional settings like genomics and is increasingly used in ML feature selection.
Effect Size
Statistical significance tells you whether an effect exists. Effect size tells you how large it is.
Cohen’s d — the standardized mean difference:
| Cohen’s d | Interpretation |
|---|---|
| 0.2 | Small effect |
| 0.5 | Medium effect |
| 0.8 | Large effect |
Key insight: With a large enough sample, you can get a “significant” p-value for a trivially small effect. Always report effect sizes alongside p-values. A 0.1% accuracy improvement can be with millions of test examples, but it’s meaningless in practice.
Hypothesis Testing in ML
Model Comparison
Comparing two models on the same dataset:
- Paired t-test on cross-validation folds: Run k-fold CV for both models, compute per-fold accuracy differences, apply a paired t-test
- McNemar’s test: For comparing classifiers on the same test set — counts disagreements between models
A/B Testing
Online experiments to compare model variants:
- Randomly assign users to control (A) or treatment (B)
- Collect a metric (click-through rate, revenue, engagement)
- Apply a two-sample test (often z-test for proportions)
Feature Importance
Testing whether a feature has a significant relationship with the target:
- t-test: Continuous features, two classes
- ANOVA (F-test): Continuous features, multiple classes
- Chi-squared: Categorical features
Summary
- Hypothesis testing quantifies whether observed patterns are real or due to chance
- The p-value is the probability of the data given , not the probability of
- Type I errors (false positives) are controlled by ; Type II errors by power analysis
- The t-test is the workhorse of statistical testing; chi-squared handles categorical data
- Confidence intervals are more informative than p-values — they show effect magnitude
- Multiple testing requires correction (Bonferroni or FDR)
- Always report effect sizes alongside significance — statistical practical significance
- For principled probability assignments to hypotheses, see Bayesian inference
References
- Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury/Thomson Learning. Chapters 8-9.
- Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. Chapters 10-11.
- Rice, J. A. (2006). Mathematical Statistics and Data Analysis (3rd ed.). Duxbury Press.
- Benjamini, Y., & Hochberg, Y. (1995). “Controlling the False Discovery Rate.” Journal of the Royal Statistical Society B, 57(1), 289—300.
- Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates.
- Demsar, J. (2006). “Statistical Comparisons of Classifiers over Multiple Data Sets.” Journal of Machine Learning Research, 7, 1—30.