Hypothesis Testing

Probability & Statistics Series 9 / 13

The Fundamental Question

Hypothesis testing answers a simple but critical question: Is the pattern I observe in data real, or could it have happened by chance?

You train a new model and it scores 2% higher than the baseline. Is that a genuine improvement, or just noise from a lucky test split? Hypothesis testing provides a rigorous framework to decide.

This framework builds directly on the Central Limit Theorem — the reason we can make probabilistic statements about sample statistics.

The Setup

Every hypothesis test has two competing hypotheses:

Null hypothesis $H_0$ : The “nothing interesting” claim. There is no effect, no difference, no relationship.
Alternative hypothesis $H_1$ (or $H_a$ ): The claim we’re trying to find evidence for.

Key insight: We never “prove” $H_1$ . We either reject $H_0$ (finding sufficient evidence against it) or fail to reject $H_0$ (not enough evidence). Absence of evidence is not evidence of absence.

Example

Question: Does a new drug lower blood pressure compared to a placebo?

$H_0$ : The drug has no effect. $\mu_{\text{drug}} = \mu_{\text{placebo}}$
$H_1$ : The drug lowers blood pressure. $\mu_{\text{drug}} < \mu_{\text{placebo}}$

Test Statistics

A test statistic is a single number computed from the data that measures how far the observed result is from what $H_0$ predicts.

The general form:

\text{test statistic} = \frac{\text{observed value} - \text{expected value under } H_0}{\text{standard error}}

If $H_0$ is true, the test statistic follows a known distribution (often Gaussian or $t$ -distribution, thanks to the CLT). A large test statistic means the data is unlikely under $H_0$ .

p-Values

The p-value is the probability of observing a test statistic at least as extreme as the one computed, assuming $H_0$ is true:

p = P(\text{test statistic} \geq t_{\text{observed}} \mid H_0)

A small p-value means: “If nothing interesting were happening, this result would be very unlikely.” The smaller the p-value, the stronger the evidence against $H_0$ .

Decision Rule

Choose a significance level $\alpha$ (typically 0.05) before looking at the data:

If $p \leq \alpha$ : Reject $H_0$ . The result is “statistically significant”
If $p > \alpha$ : Fail to reject $H_0$ . Not enough evidence

Warning: A p-value is NOT the probability that $H_0$ is true. It’s the probability of the data (or more extreme) given $H_0$ . These are very different things. $P(\text{data} \mid H_0) \neq P(H_0 \mid \text{data})$ . For the latter, you need Bayesian inference.

Common Misconceptions

Misconception	Reality
” $p = 0.03$ means 3% chance $H_0$ is true”	$p$ is about the data, not the hypothesis
” $p > 0.05$ means $H_0$ is true”	Failure to reject $\neq$ proof of no effect
”Significant = important”	Statistical significance $\neq$ practical importance
”Small p = large effect”	p-value depends on sample size, not effect size

Types of Errors

	$H_0$ is actually true	$H_0$ is actually false
Reject $H_0$	Type I error (false positive)	Correct (true positive)
Fail to reject $H_0$	Correct (true negative)	Type II error (false negative)

Type I Error (False Positive)

Probability: $\alpha$ (the significance level)

We conclude there’s an effect when there isn’t one. Setting $\alpha = 0.05$ means we accept a 5% false positive rate.

Type II Error (False Negative)

Probability: $\beta$

We miss a real effect. This depends on:

The true effect size — larger effects are easier to detect
The sample size $n$ — more data means more power
The significance level $\alpha$ — stricter threshold means more missed effects
The variance $\sigma^2$ — noisier data makes detection harder

Statistical Power

Power $= 1 - \beta$ is the probability of correctly detecting a real effect.

\text{Power} = P(\text{reject } H_0 \mid H_1 \text{ is true})

A common target is 80% power. Power analysis before an experiment determines the sample size needed to detect a given effect size.

Intuition: Think of power as a detector’s sensitivity. A metal detector with low power misses buried treasure. High power means you find effects that are actually there.

Common Tests

Z-Test

For testing a mean when $\sigma$ is known (or $n$ is large):

Z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}}

Under $H_0$ , $Z \sim \mathcal{N}(0, 1)$ .

One-Sample t-Test

For testing a mean when $\sigma$ is unknown (the common case):

t = \frac{\bar{X} - \mu_0}{s / \sqrt{n}}

where $s$ is the sample standard deviation. Under $H_0$ , $t \sim t_{n-1}$ (Student’s $t$ -distribution with $n - 1$ degrees of freedom).

The $t$ -distribution has heavier tails than the Gaussian, accounting for the extra uncertainty from estimating $\sigma$ . As $n \to \infty$ , $t_{n-1} \to \mathcal{N}(0, 1)$ .

Two-Sample t-Test

Comparing means of two independent groups:

t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}

Example: Is model A’s accuracy significantly different from model B’s? Collect accuracy scores on $k$ test splits for each model and apply a two-sample t-test.

Paired t-Test

When observations are paired (same subjects, before/after):

t = \frac{\bar{d}}{s_d / \sqrt{n}}

where $d_i = X_{1,i} - X_{2,i}$ are the paired differences.

In ML: Comparing two models on the same test folds — each fold gives a paired observation.

Chi-Squared Test

For testing relationships between categorical variables:

\chi^2 = \sum_{i} \frac{(O_i - E_i)^2}{E_i}

where $O_i$ are observed counts and $E_i$ are expected counts under $H_0$ (independence).

In ML: Feature selection — testing whether a categorical feature is associated with the target variable.

Confidence Intervals

A confidence interval is the “inversion” of a hypothesis test. Instead of testing a specific value, it gives a range of plausible values for a parameter.

A $(1 - \alpha)$ confidence interval for the mean:

\bar{X} \pm t_{\alpha/2, n-1} \cdot \frac{s}{\sqrt{n}}

Interpretation

A 95% confidence interval means: if we repeated the experiment many times, 95% of the resulting intervals would contain the true parameter.

Warning: It does NOT mean “there’s a 95% probability the true value is in this specific interval.” The true value is fixed — it’s either in the interval or not. The probability statement is about the procedure, not the specific interval.

import numpy as np
from scipy import stats

data = np.array([85.3, 87.1, 86.5, 88.2, 84.9, 87.8, 86.1, 88.5])
n = len(data)
mean = np.mean(data)
se = stats.sem(data)
ci = stats.t.interval(0.95, df=n-1, loc=mean, scale=se)

print(f"Mean: {mean:.2f}")
print(f"95% CI: ({ci[0]:.2f}, {ci[1]:.2f})")

Relationship to Hypothesis Tests

A 95% confidence interval and a two-sided test at $\alpha = 0.05$ are equivalent:

If $\mu_0$ is inside the CI, the test fails to reject $H_0$
If $\mu_0$ is outside the CI, the test rejects $H_0$

Confidence intervals are often more informative than p-values because they show both the direction and magnitude of the effect.

Multiple Testing Problem

Running many tests inflates the false positive rate. If you test 20 independent hypotheses at $\alpha = 0.05$ , the probability of at least one false positive is:

1 - (1 - 0.05)^{20} = 1 - 0.95^{20} \approx 0.64

A 64% chance of a false discovery — far from the intended 5%.

Bonferroni Correction

The simplest fix: use $\alpha / m$ as the significance level, where $m$ is the number of tests:

\alpha_{\text{adjusted}} = \frac{\alpha}{m}

For 20 tests at $\alpha = 0.05$ : test each at $0.05/20 = 0.0025$ .

This controls the family-wise error rate (FWER) — the probability of any false positive. The cost: greatly reduced power.

Benjamini-Hochberg (FDR)

A less conservative approach that controls the false discovery rate (FDR) — the expected proportion of false positives among rejections:

Sort the $m$ p-values: $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}$
Find the largest $k$ such that $p_{(k)} \leq \frac{k}{m} \cdot \alpha$
Reject all hypotheses with $p \leq p_{(k)}$

FDR control is standard in high-dimensional settings like genomics and is increasingly used in ML feature selection.

Effect Size

Statistical significance tells you whether an effect exists. Effect size tells you how large it is.

Cohen’s d — the standardized mean difference:

d = \frac{\bar{X}_1 - \bar{X}_2}{s_{\text{pooled}}}

Cohen’s d	Interpretation
0.2	Small effect
0.5	Medium effect
0.8	Large effect

Key insight: With a large enough sample, you can get a “significant” p-value for a trivially small effect. Always report effect sizes alongside p-values. A 0.1% accuracy improvement can be $p < 0.001$ with millions of test examples, but it’s meaningless in practice.