Probability Fundamentals

Probability & Statistics Series 1 / 13

What is Probability?

Probability is the mathematical framework for reasoning about uncertainty. When we say “there’s a 70% chance of rain,” we’re assigning a numerical measure to how likely an event is. This seemingly simple idea underpins all of statistics, machine learning, and data science.

At its core, probability answers the question: Given everything I know, how likely is this outcome?

Sample Spaces and Events

Every probabilistic experiment has a sample space — the set of all possible outcomes.

Coin flip: $S = \{\text{Heads}, \text{Tails}\}$
Die roll: $S = \{1, 2, 3, 4, 5, 6\}$
Two coin flips: $S = \{HH, HT, TH, TT\}$

An event is a subset of the sample space. For example, “rolling an even number” is the event $\{2, 4, 6\}$ .

The Three Axioms

All of probability theory rests on three axioms, formalized by Andrey Kolmogorov in 1933:

Non-negativity: $P(A) \geq 0$ for any event $A$
Normalization: $P(S) = 1$ (something must happen)
Additivity: For mutually exclusive events $A$ and $B$ , $P(A \cup B) = P(A) + P(B)$

From these three simple rules, we can derive everything else.

Derived Rules

From the axioms, several useful results follow immediately:

Complement: $P(A^c) = 1 - P(A)$
Impossible event: $P(\emptyset) = 0$
Inclusion-exclusion: $P(A \cup B) = P(A) + P(B) - P(A \cap B)$
Monotonicity: If $A \subseteq B$ , then $P(A) \leq P(B)$

Conditional Probability

The probability of $A$ given that $B$ has occurred is:

P(A \mid B) = \frac{P(A \cap B)}{P(B)}

This is one of the most important formulas in all of probability. It captures the idea that new information changes our beliefs.

Example: A medical test is 95% accurate. If 1% of the population has a disease, what’s the probability you actually have the disease given a positive test? The answer might surprise you — it’s much lower than 95%.

The Law of Total Probability

If events $B_1, B_2, \ldots, B_n$ partition the sample space (mutually exclusive and exhaustive), then:

P(A) = \sum_{i=1}^{n} P(A \mid B_i) \cdot P(B_i)

This lets us compute probabilities by breaking them into cases. It is the key ingredient in the denominator of Bayes’ theorem.

Independence

Two events $A$ and $B$ are independent if knowing one tells you nothing about the other:

P(A \mid B) = P(A)

Equivalently: $P(A \cap B) = P(A) \cdot P(B)$ .

Coin flips are independent — the coin has no memory. But drawing cards without replacement is not independent.

Conditional Independence

Events $A$ and $B$ are conditionally independent given $C$ if:

P(A \cap B \mid C) = P(A \mid C) \cdot P(B \mid C)

Key insight: Independence and conditional independence are different concepts. Two events can be marginally independent but conditionally dependent (or vice versa). Conditional independence is the foundation of graphical models like Bayesian networks and Naive Bayes classifiers.

Bayes’ Theorem

Bayes’ theorem lets us reverse conditional probabilities:

P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}

In words: the probability of a hypothesis given evidence equals the likelihood of the evidence given the hypothesis, times the prior probability of the hypothesis, divided by the overall probability of the evidence.

The Terms

$P(A \mid B)$ — Posterior: what we want to know (updated belief)
$P(B \mid A)$ — Likelihood: how probable the evidence is under our hypothesis
$P(A)$ — Prior: our belief before seeing evidence
$P(B)$ — Evidence: total probability of observing the data

Medical Test Example

Let’s solve that medical test problem:

Prior: $P(\text{Disease}) = 0.01$
Likelihood: $P(\text{Positive} \mid \text{Disease}) = 0.95$
False positive rate: $P(\text{Positive} \mid \text{No Disease}) = 0.05$

\begin{aligned} P(\text{Disease} \mid \text{Positive}) &= \frac{P(\text{Positive} \mid \text{Disease}) \cdot P(\text{Disease})}{P(\text{Positive})} \\[6pt] &= \frac{0.95 \cdot 0.01}{0.95 \cdot 0.01 + 0.05 \cdot 0.99} \\[6pt] &= \frac{0.0095}{0.0590} \\[6pt] &\approx 0.161 \end{aligned}

Only about 16% — even with a 95% accurate test! This counterintuitive result happens because the disease is rare. Most positive tests are false positives.

Key insight: The prior matters enormously. When the base rate is low, even accurate tests produce mostly false positives. This is why screening tests for rare diseases are followed up with more specific confirmatory tests.

Why This Matters for ML

Every machine learning model is, at its heart, a probabilistic model:

Classification estimates $P(\text{class} \mid \text{features})$ — a conditional probability
Bayesian methods use Bayes’ theorem to update model parameters
Loss functions are derived from probabilistic principles
Generative models learn full probability distributions over data

Understanding these foundations is essential for everything that follows in this series, starting with random variables and expectation.

Summary

Probability measures uncertainty on a scale from 0 to 1
Three axioms give rise to all of probability theory
Conditional probability captures how evidence updates beliefs
Independence means one event carries no information about another
Bayes’ theorem lets us reverse conditional probabilities
The prior probability critically affects posterior conclusions
These concepts form the language in which modern ML is written

References

Kolmogorov, A. N. (1933). Foundations of the Theory of Probability. Chelsea Publishing Company.
Bertsekas, D. P., & Tsitsiklis, J. N. (2008). Introduction to Probability (2nd ed.). Athena Scientific.
Ross, S. M. (2014). A First Course in Probability (9th ed.). Pearson.
Blitzstein, J. K., & Hwang, J. (2019). Introduction to Probability (2nd ed.). CRC Press.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapters 1-2.