- 01 Probability Fundamentals 02 Random Variables and Expectation 03 Probability Distributions 04 The Exponential Family 05 Convergence and the Central Limit Theorem 06 Maximum Likelihood Estimation 07 MAP Estimation 08 The EM Algorithm 09 Hypothesis Testing 10 Nonparametric Statistics 11 Bayesian Inference 12 Probabilistic Graphical Models 13 Sampling Methods
What is Probability?
Probability is the mathematical framework for reasoning about uncertainty. When we say “there’s a 70% chance of rain,” we’re assigning a numerical measure to how likely an event is. This seemingly simple idea underpins all of statistics, machine learning, and data science.
At its core, probability answers the question: Given everything I know, how likely is this outcome?
Sample Spaces and Events
Every probabilistic experiment has a sample space — the set of all possible outcomes.
- Coin flip:
- Die roll:
- Two coin flips:
An event is a subset of the sample space. For example, “rolling an even number” is the event .
The Three Axioms
All of probability theory rests on three axioms, formalized by Andrey Kolmogorov in 1933:
- Non-negativity: for any event
- Normalization: (something must happen)
- Additivity: For mutually exclusive events and ,
From these three simple rules, we can derive everything else.
Derived Rules
From the axioms, several useful results follow immediately:
- Complement:
- Impossible event:
- Inclusion-exclusion:
- Monotonicity: If , then
Conditional Probability
The probability of given that has occurred is:
This is one of the most important formulas in all of probability. It captures the idea that new information changes our beliefs.
Example: A medical test is 95% accurate. If 1% of the population has a disease, what’s the probability you actually have the disease given a positive test? The answer might surprise you — it’s much lower than 95%.
The Law of Total Probability
If events partition the sample space (mutually exclusive and exhaustive), then:
This lets us compute probabilities by breaking them into cases. It is the key ingredient in the denominator of Bayes’ theorem.
Independence
Two events and are independent if knowing one tells you nothing about the other:
Equivalently: .
Coin flips are independent — the coin has no memory. But drawing cards without replacement is not independent.
Conditional Independence
Events and are conditionally independent given if:
Key insight: Independence and conditional independence are different concepts. Two events can be marginally independent but conditionally dependent (or vice versa). Conditional independence is the foundation of graphical models like Bayesian networks and Naive Bayes classifiers.
Bayes’ Theorem
Bayes’ theorem lets us reverse conditional probabilities:
In words: the probability of a hypothesis given evidence equals the likelihood of the evidence given the hypothesis, times the prior probability of the hypothesis, divided by the overall probability of the evidence.
The Terms
- — Posterior: what we want to know (updated belief)
- — Likelihood: how probable the evidence is under our hypothesis
- — Prior: our belief before seeing evidence
- — Evidence: total probability of observing the data
Medical Test Example
Let’s solve that medical test problem:
- Prior:
- Likelihood:
- False positive rate:
Only about 16% — even with a 95% accurate test! This counterintuitive result happens because the disease is rare. Most positive tests are false positives.
Key insight: The prior matters enormously. When the base rate is low, even accurate tests produce mostly false positives. This is why screening tests for rare diseases are followed up with more specific confirmatory tests.
Why This Matters for ML
Every machine learning model is, at its heart, a probabilistic model:
- Classification estimates — a conditional probability
- Bayesian methods use Bayes’ theorem to update model parameters
- Loss functions are derived from probabilistic principles
- Generative models learn full probability distributions over data
Understanding these foundations is essential for everything that follows in this series, starting with random variables and expectation.
Summary
- Probability measures uncertainty on a scale from 0 to 1
- Three axioms give rise to all of probability theory
- Conditional probability captures how evidence updates beliefs
- Independence means one event carries no information about another
- Bayes’ theorem lets us reverse conditional probabilities
- The prior probability critically affects posterior conclusions
- These concepts form the language in which modern ML is written
References
- Kolmogorov, A. N. (1933). Foundations of the Theory of Probability. Chelsea Publishing Company.
- Bertsekas, D. P., & Tsitsiklis, J. N. (2008). Introduction to Probability (2nd ed.). Athena Scientific.
- Ross, S. M. (2014). A First Course in Probability (9th ed.). Pearson.
- Blitzstein, J. K., & Hwang, J. (2019). Introduction to Probability (2nd ed.). CRC Press.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapters 1-2.