- 01 Probability Fundamentals 02 Random Variables and Expectation 03 Probability Distributions 04 The Exponential Family 05 Convergence and the Central Limit Theorem 06 Maximum Likelihood Estimation 07 MAP Estimation 08 The EM Algorithm 09 Hypothesis Testing 10 Nonparametric Statistics 11 Bayesian Inference 12 Probabilistic Graphical Models 13 Sampling Methods
The Curse of High Dimensions
A joint distribution over binary variables has free parameters. For , that’s over a billion parameters — impossible to estimate from any reasonable dataset.
Probabilistic Graphical Models (PGMs) solve this by exploiting conditional independence structure. If variables only interact locally, the joint distribution factors into a product of smaller terms, each involving only a few variables.
PGMs use graphs to make this structure explicit and visual. The nodes represent random variables; the edges encode dependencies.
Two Families of Graphical Models
| Property | Bayesian Networks | Markov Random Fields |
|---|---|---|
| Graph type | Directed acyclic graph (DAG) | Undirected graph |
| Interpretation | Causal / generative | Associative / symmetric |
| Factorization | Conditional probabilities | Potential functions |
| Normalization | Automatic (local CPDs) | Requires partition function |
| Common use | Causal reasoning, generation | Images, physics, social networks |
Bayesian Networks
A Bayesian network is a directed acyclic graph where each node has a conditional probability distribution given its parents .
Factorization
The joint distribution factors as:
Each factor is a conditional probability table (CPT) for discrete variables, or a conditional density for continuous ones.
Example: Disease Diagnosis
Consider a simple medical model:
- Smoking () — binary
- Pollution () — binary
- Cancer () — depends on and
- X-ray () — depends on
- Dyspnea () — depends on
The joint factors as:
Without the graph structure, we’d need parameters. With it, we need — a significant reduction.
Conditional Independence in DAGs
The graph encodes conditional independence statements. Three fundamental structures:
Chain:
- — knowing blocks information flow from to
Fork:
- — and are independent given their common cause
Collider (V-structure):
- but — conditioning on the common effect creates a dependency (explaining away)
Explaining away: If a patient has cancer (), knowing they smoke () makes pollution () a less likely explanation. Observing the effect creates competition between its causes.
D-Separation
D-separation is the graphical criterion that determines conditional independence in Bayesian networks.
A path between nodes and is blocked by a set of observed variables if it contains:
- A chain or fork node that is in (observed intermediate blocks information)
- A collider node that is NOT in (and has no descendant in )
If all paths between and are blocked by , then .
Naive Bayes as a Bayesian Network
The Naive Bayes classifier is a simple Bayesian network where the class variable is the parent of all features:
The “naive” assumption is that features are conditionally independent given the class. This is rarely true, but the classifier works surprisingly well in practice.
Markov Random Fields (Undirected Models)
A Markov Random Field (MRF) uses an undirected graph. Instead of conditional probabilities, it uses potential functions (also called factors or clique potentials).
Factorization
where:
- is the set of cliques (fully connected subgraphs)
- is the potential function for clique
- is the partition function (normalizing constant)
The potentials encode compatibility — how “happy” a configuration is. Higher potential means more likely, but the potentials are not probabilities.
The Partition Function Problem
Computing requires summing over all possible configurations — exponential in the number of variables. This intractability is the central computational challenge of undirected models.
Conditional Independence in MRFs
The Markov property for undirected graphs is simpler than d-separation:
If removing the nodes in disconnects from , they are conditionally independent given .
Ising Model
A classical MRF from statistical physics, used for binary variables on a grid:
- Node potentials : bias toward or
- Edge potentials : preference for neighbors to agree () or disagree ()
In ML: Image denoising (neighboring pixels tend to have similar values), social network modeling, and as a building block for Boltzmann machines.
Conditional Random Fields (CRFs)
CRFs are discriminative undirected models that model directly:
Unlike generative models, CRFs don’t model the input distribution — they focus entirely on the mapping from inputs to outputs.
In ML: Named entity recognition (NER), part-of-speech tagging, image segmentation, and any structured prediction task where outputs have dependencies.
Inference in Graphical Models
Given a model, we need to answer queries: What is ?
Exact Inference
Variable Elimination: Marginalize out variables one at a time, choosing an elimination order that minimizes intermediate factor sizes.
Belief Propagation (Sum-Product Algorithm): Pass messages between nodes on the graph. Exact on trees, approximate on graphs with cycles (loopy belief propagation).
For a tree-structured graph:
After messages converge, the marginal at node is proportional to the product of all incoming messages times the node potential.
Approximate Inference
When exact inference is intractable:
- Loopy Belief Propagation: Run message passing on graphs with cycles. Not guaranteed to converge, but often works well in practice
- Variational Inference: Approximate the posterior with a simpler distribution (see Bayesian inference)
- MCMC sampling: Gibbs sampling is natural for graphical models — each variable’s full conditional depends only on its Markov blanket
Learning Graphical Models
Parameter Learning (Known Structure)
Given the graph structure, learn the parameters from data:
- Complete data: MLE or MAP for each conditional distribution independently
- Missing data: EM algorithm — E-step infers missing values, M-step updates parameters
Structure Learning (Unknown Structure)
Learning the graph itself is much harder:
Score-based methods: Search over possible structures and score each one:
- BIC/AIC scores balance fit and complexity
- Bayesian structure learning computes posterior over graphs
Constraint-based methods: Use conditional independence tests to determine edges:
- The PC algorithm tests for increasing subsets
- Discovers the graph skeleton, then orients edges
In ML: Structure learning is used in causal discovery — inferring cause-effect relationships from observational data.
PGMs in Modern ML
Classical Applications
- Medical diagnosis: Bayesian networks model disease-symptom relationships
- Speech recognition: HMMs (a type of Bayesian network) for acoustic modeling
- Computer vision: MRFs for image segmentation and denoising
- NLP: CRFs for sequence labeling tasks
Connection to Deep Learning
Modern deep learning has largely replaced explicit PGMs for many tasks, but the ideas persist:
- Variational Autoencoders are latent variable models with neural network parameterization — directed graphical models trained with variational inference
- Attention mechanisms in transformers can be viewed through the lens of graphical model message passing
- Graph Neural Networks generalize belief propagation to learned message functions
- Diffusion models are hierarchical latent variable models with a specific graphical structure
The core insight of PGMs — that exploiting conditional independence makes high-dimensional inference tractable — remains as relevant as ever, even when the specific parameterization is a neural network.
Summary
- Graphical models encode conditional independence structure using graphs
- Bayesian networks (directed) factor into conditional distributions; factorization follows from the graph
- MRFs (undirected) use potential functions; the partition function is the central challenge
- D-separation determines conditional independence in DAGs; graph separation in MRFs
- Exact inference (variable elimination, belief propagation) works on trees; approximate methods handle general graphs
- CRFs are discriminative undirected models for structured prediction
- Structure learning discovers causal relationships from data
- PGM concepts (latent variables, message passing, conditional independence) underpin modern deep learning architectures
References
- Koller, D., & Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 8.
- Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press. Chapters 10, 19-20.
- Jordan, M. I. (2003). “An Introduction to Probabilistic Graphical Models.” Unpublished manuscript. UC Berkeley.
- Wainwright, M. J., & Jordan, M. I. (2008). “Graphical Models, Exponential Families, and Variational Inference.” Foundations and Trends in Machine Learning, 1(1-2), 1—305.
- Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press.
- Lafferty, J. D., McCallum, A., & Pereira, F. C. N. (2001). “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data.” ICML 2001.