Probability Fundamentals

A first-principles guide to probability: from sample spaces and axioms to conditional probability, independence, and Bayes' theorem.

Probability & Statistics February 15, 2026 5 min read

What is Probability?

Probability is the mathematical framework for reasoning about uncertainty. When we say “there’s a 70% chance of rain,” we’re assigning a numerical measure to how likely an event is. This seemingly simple idea underpins all of statistics, machine learning, and data science.

At its core, probability answers the question: Given everything I know, how likely is this outcome?

Sample Spaces and Events

Every probabilistic experiment has a sample space — the set of all possible outcomes.

  • Coin flip: S={Heads,Tails}S = \{\text{Heads}, \text{Tails}\}
  • Die roll: S={1,2,3,4,5,6}S = \{1, 2, 3, 4, 5, 6\}
  • Two coin flips: S={HH,HT,TH,TT}S = \{HH, HT, TH, TT\}

An event is a subset of the sample space. For example, “rolling an even number” is the event {2,4,6}\{2, 4, 6\}.

The Three Axioms

All of probability theory rests on three axioms, formalized by Andrey Kolmogorov in 1933:

  1. Non-negativity: P(A)0P(A) \geq 0 for any event AA
  2. Normalization: P(S)=1P(S) = 1 (something must happen)
  3. Additivity: For mutually exclusive events AA and BB, P(AB)=P(A)+P(B)P(A \cup B) = P(A) + P(B)

From these three simple rules, we can derive everything else.

Derived Rules

From the axioms, several useful results follow immediately:

  • Complement: P(Ac)=1P(A)P(A^c) = 1 - P(A)
  • Impossible event: P()=0P(\emptyset) = 0
  • Inclusion-exclusion: P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B)
  • Monotonicity: If ABA \subseteq B, then P(A)P(B)P(A) \leq P(B)

Conditional Probability

The probability of AA given that BB has occurred is:

P(AB)=P(AB)P(B)P(A \mid B) = \frac{P(A \cap B)}{P(B)}

This is one of the most important formulas in all of probability. It captures the idea that new information changes our beliefs.

Example: A medical test is 95% accurate. If 1% of the population has a disease, what’s the probability you actually have the disease given a positive test? The answer might surprise you — it’s much lower than 95%.

The Law of Total Probability

If events B1,B2,,BnB_1, B_2, \ldots, B_n partition the sample space (mutually exclusive and exhaustive), then:

P(A)=i=1nP(ABi)P(Bi)P(A) = \sum_{i=1}^{n} P(A \mid B_i) \cdot P(B_i)

This lets us compute probabilities by breaking them into cases. It is the key ingredient in the denominator of Bayes’ theorem.

Independence

Two events AA and BB are independent if knowing one tells you nothing about the other:

P(AB)=P(A)P(A \mid B) = P(A)

Equivalently: P(AB)=P(A)P(B)P(A \cap B) = P(A) \cdot P(B).

Coin flips are independent — the coin has no memory. But drawing cards without replacement is not independent.

Conditional Independence

Events AA and BB are conditionally independent given CC if:

P(ABC)=P(AC)P(BC)P(A \cap B \mid C) = P(A \mid C) \cdot P(B \mid C)

Key insight: Independence and conditional independence are different concepts. Two events can be marginally independent but conditionally dependent (or vice versa). Conditional independence is the foundation of graphical models like Bayesian networks and Naive Bayes classifiers.

Bayes’ Theorem

Bayes’ theorem lets us reverse conditional probabilities:

P(AB)=P(BA)P(A)P(B)P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}

In words: the probability of a hypothesis given evidence equals the likelihood of the evidence given the hypothesis, times the prior probability of the hypothesis, divided by the overall probability of the evidence.

The Terms

  • P(AB)P(A \mid B)Posterior: what we want to know (updated belief)
  • P(BA)P(B \mid A)Likelihood: how probable the evidence is under our hypothesis
  • P(A)P(A)Prior: our belief before seeing evidence
  • P(B)P(B)Evidence: total probability of observing the data

Medical Test Example

Let’s solve that medical test problem:

  • Prior: P(Disease)=0.01P(\text{Disease}) = 0.01
  • Likelihood: P(PositiveDisease)=0.95P(\text{Positive} \mid \text{Disease}) = 0.95
  • False positive rate: P(PositiveNo Disease)=0.05P(\text{Positive} \mid \text{No Disease}) = 0.05
P(DiseasePositive)=P(PositiveDisease)P(Disease)P(Positive)=0.950.010.950.01+0.050.99=0.00950.05900.161\begin{aligned} P(\text{Disease} \mid \text{Positive}) &= \frac{P(\text{Positive} \mid \text{Disease}) \cdot P(\text{Disease})}{P(\text{Positive})} \\[6pt] &= \frac{0.95 \cdot 0.01}{0.95 \cdot 0.01 + 0.05 \cdot 0.99} \\[6pt] &= \frac{0.0095}{0.0590} \\[6pt] &\approx 0.161 \end{aligned}

Only about 16% — even with a 95% accurate test! This counterintuitive result happens because the disease is rare. Most positive tests are false positives.

Key insight: The prior matters enormously. When the base rate is low, even accurate tests produce mostly false positives. This is why screening tests for rare diseases are followed up with more specific confirmatory tests.

Why This Matters for ML

Every machine learning model is, at its heart, a probabilistic model:

  • Classification estimates P(classfeatures)P(\text{class} \mid \text{features}) — a conditional probability
  • Bayesian methods use Bayes’ theorem to update model parameters
  • Loss functions are derived from probabilistic principles
  • Generative models learn full probability distributions over data

Understanding these foundations is essential for everything that follows in this series, starting with random variables and expectation.

Summary

  • Probability measures uncertainty on a scale from 0 to 1
  • Three axioms give rise to all of probability theory
  • Conditional probability captures how evidence updates beliefs
  • Independence means one event carries no information about another
  • Bayes’ theorem lets us reverse conditional probabilities
  • The prior probability critically affects posterior conclusions
  • These concepts form the language in which modern ML is written

References

  • Kolmogorov, A. N. (1933). Foundations of the Theory of Probability. Chelsea Publishing Company.
  • Bertsekas, D. P., & Tsitsiklis, J. N. (2008). Introduction to Probability (2nd ed.). Athena Scientific.
  • Ross, S. M. (2014). A First Course in Probability (9th ed.). Pearson.
  • Blitzstein, J. K., & Hwang, J. (2019). Introduction to Probability (2nd ed.). CRC Press.
  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapters 1-2.

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay